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FOREWORD 


This document is the Final Report of a multiproces 
sor architectural design study, whose objective was to 
establish a baseline design for a central multiprocessor 
for a Space Station Data Management System exploiting 
the NASA/MSFC developed SUMC hardware where possible. 

The study was sponsored by the NASA Marshall Space Flight 
Center, Huntsville, Alabama, under contract NAS8—28605, 
entitled, Research Study on Memory Hierarchy. It was 
performed by Intermetrics, Inc, Cambridge, Massachusetts, 
over the period June to October 1972, under the direc 
tion of Alex L. Kosmala. Technical monitors for MSFC 
were Mr. Gerald L. Turner and Mr. James L. Lewis. 

Publication of this report does not constitute 
approval by NASA of the findings or conclusions contained 
therein. 
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ABSTRACT 


This is an architectural design study of a multipro- 
cessor computing system intended to meet functional and per- 
formance specifications appropriate to a manned space station 
application as defined by NASA's Marshall Space Flight Center. 
Intermetrics previous experience and accumulated knowledge of 
the multiprocessor field is used to generate a baseline philo- 
sophy for the design of a future SUMC* multiprocessor. 

The operating system design problem for multiproces- 
sors is to approach the theoretical performance without sacri- 
ficing fault tolerance, flexibility, and expandability. Para- 
llel tasking is described as a necessary operating capability 
in this regard, while exclusive operators are also needed to 
avoid critical section conflicts. Synchronization, scheduling, 
and deadlock prevention are other system design features which 
are discussed, along with memory management. Treatment of the 
topics of operating system specification and structuring, and 
the use of a higher order language complete the discussion of 
multiprocessor operating systems. 

Interrupts are defined and the crucial questions of 
interrupt structure, such as processor selection and response 
time, are discussed. Memory hierarchy and performance is dis- 
cussed extensively with particular attention to the design ap- 
proach which utilizes a cache memory associated with each pro- 
cessor. The ability of an individual processor to approach its 
theoretical maximum performance is then analyzed in terms of a 
hit ratio, which is the proportion of time that a memory re- 
quest can be supplied from cache only. Memory management is 
envisioned as' a virtual memory system implemented either through 
segmentation or paging. 

Addressing is discussed in terms of various register 
design adopted by current computers and those of advanced de- 
sign. Using examples, two dimensional addressing, implicit 
addressing, and the use of descriptors are described. Imple- 
mentation of a stack-oriented machine is explained, along with 
the generation of an Effective Address scheme. The overall I/O 
architecture set forth is upon a Data Bus I/O to service an 


* Space Ultra-reliable Modular Computer 
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advanced data bus concept and a Mass Storage I/O. The I/O Con- 
troller design is then discussed in terms of interfaces to the 
processors and to the memories with special emphasis given to 
recovery from failure. 

A complete chapter is devoted to error detection, 
fault isolation, and recovery philosophy as applied to a mul- 
tiprocessor system. The important topic of concept verifica 
tion is given careful scrutiny in terms of 

a ) analytical techniques and high-level computer simula- 
tion , and 

b) detailed, low-level simulation. 

Finally, the report concludes with a detailed critique of SUMC ' s 
architectural characteristics in relationship to the overall 
design objectives. 
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Chapter 1 
INTRODUCTION 


1.1 


Scope and Objectives 


The work described in this report is the result of a 
study of multiprocessing system design principles, performed 
in support of the MSFC in-house multiprocessor computer deve- 
lopment. The initial objectives of the study were to achieve 
a top-level architectural design capable of meeting the func- 
tional and performance specifications established for the Phase 
B Space Station Information Management System Central Processor, 
and in doing so to exploit as much as possible the current MSFC- 
deve loped SUMC processor design. However, during the early 
phases of the study it became apparent that in order to preserve 
the value of an independently derived evaluation of multipro- 
cessor design features by Intermetrics, some deviation from 
these objectives would be necessary. The basic philosophies of 
multiprocessor design and operation espoused by Intermetrics in 
defining an architecture appropriate to the Space Station re- 
quirements were found to be incompatible with those already adop 
ted by MSFC in arriving at the current SUMC design. Consequently 
it was mutually agreed that rather than using the existing SUMC 
design as the basis for the study, Intermetrics should apply the re- 
sults of their previous experience and accumulated knowledge of 
the multiprocessor field to establishing a SUMC architecture 
from an entirely independent point of view. Much of that point 
of view was gathered in the performance of a previous design 
study [1] with very similar objectives to those expressed for 
the SUMC multiprocessor. Although some of the philosophies 
which are embodied in that design were directly applicable, it 
was decided not to tailor the complete design to the SUMC app- 
lication by adopting some features and discarding others . In 
stead, it was decided to select certain multiprocessor design 
areas and hardware features and perform an in-depth analysis, 
review and evaluation for each, in order to establish _ the phil- 
osophies and the rationale developed by Intermetrics m their 
approach to a multiprocessor design. The objective was to pro 
vide a baseline philosophy for the design of a future version 
of the SUMC multiprocessor, radically different from the one 
proposed in the present MSFC in-house development program. 

In Intermetrics opinion the design of a multiprocessor 
for a Space Station application should be guided by the follow- 
ing considerations: 
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a ) The performance potentially achievable through the use 

of multiple processors (often quoted as the main moti- 
vation of multiprocessing but, as will be explained in 
Chapter 2, very . difficult to achieve) should not be 
compromised by implementational incompatibilities, es- 
pecially in the executive system, nor sacrificed to 
achieve other MP objectives such as fault tolerance, 
flexibility, and expandability. 

Since the overall cost of providing computational capa- 
bilities (especially in a difficult environment like a 
Space Station) may be dominated by software costs rather 
than hardware , the architecture and operating character- 
istics of the computer must reflect the needs, desires 
and techniques of the programmer rather than those of 
the logic designer. 

c ) The outstanding advantage of a multiprocessor is its 

potential tolerance to failures of its components . 

This capability should be realized in the initial ar- 
chitectural design, and not provided as a final touch 
after most design decisions have been made. 

The detailed analysis of the areas of multiprocessor 
design which were selected for this study reflect the above 
basic _ attitude . They form most of the chapters in the remainder 
of this report, and include the following topics: Operating 

System design (Chapter 2) ; Interrupt Structure (Chapter 3) ; 

Memory Hierarchy (Chapter 4); Addressing (Chapter 5); I/O Con- 
siderations (Chapter 6); Fault Tolerance (Chapter 7) . Additional 
chapters cover Concept Verification (Chapter 8) , since it was of 
some concern to MSFC how any given multiprocessor design could 
be given a quantitative evaluation without incurring the initial 
investment of a hardware build phase, and a critique of the SUMC 
processor internal architecture (Chapter 9). 

Much of the description and terminology found in this 
report assumes a familiarity with Intermetrics 1 previous multi- 
processor design. To prevent unnecessary (and probably inade- 
quate) repetition of the details of that design, the reader is 
referred to reference [1] . However, to provide an introduction 
to at least some of the terms used we present the following over- 
view of the configuration of hardware and software elements of 
the design, as extracted from sections of reference [1], 


1*2 Overview of Intermetrics 1 Multiprocessor 

The basic configuration of the multiprocessor is shown 
in Figure 1. The MP was specified to consist of a number of 


1-2 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 


I 




1-3 


Intermetrics Multiprocessor Basic Configuration 






identical, interchangeable processing elements which would execute 
the major processing workload, and a single, more speciali2ed pro- 
cessor to handle I/O processing and a number of other unique func- 
tions. These functions include interrupt handling, interprocessor 
communication control, and the central timer. The executive was 
specified to be non-dcdicated (to any given processor) , and its 
functions are performed by any of the processors. The choice of 
which processor is made on the basis of status (e.g., by having 
completed its current assignment) , or by reason of its greatest 
interruptibility as determined by the priority of its current 
process. The number of computational processors was specified 
as three, because the resulting configuration represents the 
simplest which possesses completely all the characteristics (and 
problems) of the n-processor case. The two processor system 
which has received the greatest amount of development and opera- 
tional experience of all configurations, represents a degenerate 
form of multiprocessor: while certainly exhibiting true concur- 

rency of processes, nevertheless the dual processor allows cer- 
tain simplifications of executive functions to be made because 
of the binary number of active elements in the system. The mem- 
ory terminology in the figure is used in parts of this report, 
and is defined as follows: 

a) Ml: Local memory, dedicated to, and only for use by 

a processor. This is a general term and refers 
to all aspects of buffer, scratchpad, control 
and associative memory, required by a processing 
element. The contents of any Ml storage cell are 
available only to the processor of which Ml is 
an intimate component. Only in case of recovery 
after a P and/or Ml failure are these contents 
made available to another processor. In this MP 
design Ml is not, strictly, a member of the mem- 
ory hierarchy. 

b) M2: Operating memory (main memory, or, in' popular 

terms, "core"). M2 consists of several individual 
memory modules, all of which are accessible to all 
processors, including the I/O controller. Each 
access takes place via a data path dedicated to 
each processing element, through a port in each 
M2 module. The basic MP configuration, therefore, 
requires four ports per M2 module. Each module is 
fourway interleaved, for purposes of speed, access, 
conflict resolution, and fault recovery. 

c) M3: Secondary storage (backup or Mass Memory) . Being 

a conventional drum or disk, it was decided to 
interface this level of- the memory hierarchy with 
the rest of the computer system in the more con- 
ventional manner, via an I/O channel. The use of 
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M3 to implement the concept of virtual Memory then 
places the heaviest requirement on the design of 
the I/O controller and the I/O executive routines. 


As mentioned above, several unique functions were gath- 
ered together into one, unique module, which is (foi convenience) 
termed the I/O controller (IOC). All interfaces to the outside 
world were handled via the IOC. 


Communication between the processing elements of the MP 
system (the P's and IOC) were handled by a separate interproces- 
sor bus (IPCB) . 


(It should be emphasized 
q-f v-i mire 1 does not i ndi cate the 
for fault detection and/or recove 
aspects, refer to Chapter 7.) 


that the basic configuration 
levels of redundancy specified 
ry . For a discussioii of these 


The terminology used in this report refers to the way 
in which information was organized and handled in the previous 
Intermetrics work . The key terms and their assumed definition 
are as follows: 


Program: This is an independently compilable section 

of code containing pure procedures and/or data. 

Procedure: A section of code to which execution control 

can lieTpassed, with or without the passage of parameters. 

1) Internal, not known outside of process (see below) 

2) External, known to name manager and declared in 
the Process Information Area (see below) 

gggjnQnt : A contiguous block of words defined by a 

descriptor, which is the unit of memory management. 

Process: The unit of work as recognized by the opera 

ting "system. A process is represented by a stack. 

Stack: Although strictly a LIFO list, the definition 

of a stack is less rigorous when used to represent a 
process . 

Level: A demarcation in the addressing hierarchy . 

Derived from the. concept of lexicographical level in 
block structured language (such as ALGOL or HAL) , but 
extended to provide convenient addressing by the 
operating system. 
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Figure 2 illustrates the relationship and use of some 
of these terms. Each process is represented by an execution 
stack. The initial hierarchical level for process execution, 
and therefore the lowest numerical level for any process stack, 
is level 2. Subsequent procedure nesting varies the lexical 
level of each process stack to 3 , 4, 5, etc. The portion of a 
process stack that is below level 2 contains a collection of 
data termed the Process Information Area (PIA) containing names, 
priorities, counters, for bookkeeping, etc., specific to each 
process. Above the PIA the stack behaves more strictly as a 
LIFO list. 


Each process has associated with it a vector of des- 
criptors defining the segments containing the procedures to be 
executed by the process. These descriptors are addressed as if 
the vector were a stack: by stack number and offset from the 

base of the stack. For convenience, this collection of segment 
descriptors is termed Level 1, since it exists at a more global 
level than the individual processes, and each such vector will 
be referred to as a stack (even though, strictly, it is not) . 

At the most fundamental level there is a single collec- 
tion of basic system descriptors, variables, etc., which is 
termed the Level 0 stack, again for convenience of addressing. 
One descriptor at level 0 points to the stack vector, which con 
tains descriptors of all the stacks in the system including the 
"pseudo-stacks" of levels 1 and 0. 

Each processor contained a set of hardware registers 
which indicated the actual M2 addresses of the start of each 
of the system levels, i.e., the base address of the correspond- 
ing stack. Figure 2 also shows the linkages that tie the Corn- 
pool mechanism into the system. 

The operating system design philosophy reflected an 
emphasis on the achievement of reliable operation of both 
hardware and software. It was assumed that only higher order 
language (s) would be used in the programming of application 
software. The exclusive use of IIOLs allows secure system op- 
eration to be realized without exhaustive runtime verification 
of each request for OS functions. An intimate and well-defined 
interface between OS and the compiler (s) was assumed achievable, 
so that an optimal division between static (pre— run) and dyna- 
mic (runtime) diagnosis could be made. 

It was assumed that the language/compiler to be used 
in programming the Space Station application software would 
possess the facility of handling common data pools (Compools) . 
The MP design provided a Compool implementation. 
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Reference for Chapter 1 


1. Miller, J.S., et. al., "Engineering Study for the 

Functional Design of a Multiprocessor Design", In- 
termetrics/NASA Contract NAS9-11745, September, 
1972. 
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Chapter 2 

MULTIPROCESSOR OPERATING SYSTEM DESIGN 


2.1 


Introduction 


This chapter will discuss the special problems facing 
the designer of an operating system for a multiprocessor com- 
rv^ter mho qnom of the task which is summarized here did not 
encompass all aspects of OS design. Emphasis is placed on the 
more imoortant functions and on those aspects of OS which are 
unique to, or at least more significant for, multiprocessors 
as compared with simplex computers. 


An operating system for a space station multiprocessor 
will be capable of supporting a wide variety of functions. Al- 
though some of these may be unique to the application, it is 
very probable that the following standard functions will always 
be required in some measure: 


a) Initialization 

This deals with the initial introduction of informa- 
tion into the computing system and its preparation for 
eventual execution. It includes bootstrapping from a 
cold start, establishing the minimum state from which 
the complete system structure can be created, the pro- 
blems associated with loading and linking of programs 
and data for execution, etc. This topic is not a tri- 
vial one: a real-time MP/OS is a complex structure and 

the problem of establishing it as a working entity, 
from scratch should be considered at the time its ini- 
tial design is undertaken. Initialization will not be 
discussed further. 

b) Process State Controller 

The basic element of computational work will be termed 
a Process. Processes can exist in various states: ex- 
ecution, readiness, stall or suspension. This function 
of the OS controls the orderly progression of processes 
between these states in response to various stimuli , ^ 
such as voluntary process state changes, I/O interrupts, 
priority changes, interprocess communication, etc. 
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c) 


Interrupt Servicing 


A real time, general purpose, central computer for a 
space station will almost certainly be required to 
handle system-originated external interrupts in addi- 
tion to interruptions due to arithmetic traps and 
other error conditions. This OS function implements 
the desired responses to randomly occurring events of 
this nature. 

d) Timing and Synchronization 

This function provides the basic mechanism for control- 
ling the time dependent execution, and the synchroniza- 
tion of parallel, concurrent processes in a real-time, 
multiprogrammed environment . 

e) Resource Management 

This is the basic function of an operating system. The 
resources required by a computational process are vari- 
ous. First, there are the basic hardware elements: 
the processors, memory modules, and interconnecting data 
paths which must be available to allow the process to 
run. Then there are the less tangible items such as 
common programs and data over which conflict of access 
by several concurrent processes is possible. Lastly, 
there is external device availability: sensors, avio- 

nics data buses, disks, tapes, etc. The resource mana- 
gement function is usually divided into processor allo- 
cation, memory allocation, compool and shared data 
management, I/O and file management. It is the function 
of resource allocation to ensure that each scheduled 
process is granted a sufficient share of the available 
resources to execute in a timely fashion without adverse 
effect on other processes. 

f) Configuration Control 

In a fault tolerant computer, the current status and 
the configuration of all elements of the computer must 
be continuously monitored and controlled by the opera- 
ting system. 

g) . Operator and User Interfaces 

The OS must provide facilities to interface with the 
operator and/or user. For a complex system this is not 
a simple task, especially when a major mode of operation 
is interactive usage, by the crew members in controlling 
the progress of a mission. 
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h) 


Performance Monitoring 


Vnin is an often under-emphasized function of an opera- 
ting system, but it is an especially important one in a 
new or novel application such as a space station MP . 

The more sophisticated a system is the greater is the 
need to measure, evaluate and influence its performance. 

Some of these functions will be reviewed again in the 
light of the following discussion of problems facing the multi- 
processing operating system designer. 


2 . 2 Prob lems of Mu ltiprocessing 

The multiprocessing environment does not pose any diffi- 
culties that the designer of an operating system for a multipro 
grammed , single processor system has not also had to face and 
overcome. The MP adds new facets to familiar problems, however, 
by reason of the concurrent , rather than sequential , execution 
of the multiple processes within the system. This requires that 
greater care be taken to prevent damaging interaction between 
processes at a point of commonality, especially v/ith regard to 
shared data. Measures taken to protect processes against each 
other usually affect performance unfavorably. The maintenance 
of performance near the theoretical limit is, in any case, more 
difficult for a multiprocessor than for an equivalent simplex 
computer . 


An attractive feature of the multiprocessor is the pro- 
spect of increased performance achieved by means other than ad- 
vances in processor technology, i.e. , n similar processors doing 
the work of one n times as fast. In practice several factors 
prevent this promise from being fulfilled. If we define "through- 
put" as the integral over time of the rate of "useful computa- 
tion C, then it can be shown that: 



Cdt > 



n (C) dt 
n 


where n is the number of processors. C is a discontinuous func- 
tion of time, and as n increases, it becomes increasingly diffi 
cult for C to remain non-zero for long periods of time. Compu- 
tation lost whenever C falls to zero may not be made up in time, 
and the right hand (multiprocessor) integral continuously loses 
ground to the left hand (simplex computer) integral. The reasons 
for this are enumerated below. 
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2 . 2.1 


Parallelism 


In order for all n processors to be kept usefully at 
work, their load must be capable of being organized into n or 
more tasks which can be executed in parallel, continuously and 
simultaneously. The degree to which this can be done depends 
on the parallelism inherent in the work load. Certain types 
of computation exhibit natural parallelism, e.g., signal pro- 
cessing, where the same operation is applied to multiple sets 
of input data (promoting the design of so-called Single Instruc- 
tion Multiple Data (SIMD) computers, for example the Goodyear 
Associative Processor [1]). But, in general, parallelism must 
be sought out, identified and utilized. It exists potentially 
on several levels: 

a) On the "job" level. In a general purpose computer fac- 
ility, the submitted jobs are normally completely inde- 
pendent of one another, even if they share resources. 

b) Within a job, at the task level. 

c) Within a task, most of the statements are independent 
of one another. 

d) Within a single statement some computations can be done 
in parallel. 

Parallelism of types c) and d) is not visible to the 
operating system, because the basic unit of OS is the process 
(or task) . For the type of application being considered for the 
SUMC multiprocessor, it is not likely that the work load will 
totally resemble that of a ground based general purpose facility, 
although it will exhibit more of its aspects than will a simple 
flight control computer. Parallelism of type a) will probably 
not be present in sufficient proportion to provide the sole guar- 
antee of full employment for two or more processors. It becomes 
necessary to deal, additionally, with parallelism at the task 
level. The trouble is that problem solving with a computer is, 
in general, a serial process: programmers do not naturally think 

in terms of concurrent parallel processes in arriving at their 
solutions, unless such a structure is inherent in the problem. 

A real time control function may conists of several, more or less 
independent, activities going on in parallel, e.g., system moni- 
toring, navigation, display processing, and vehicle control. 

Even so, it is anticipated that there will not be sufficient 
functions of this type to keep two or more processors fully oc- 
cupied, all the time. 

It is necessary, therefore, to uncover task parallelism 
that may not be apparent, and even to create parallelism if none 
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exists. This imposes a constraint on the programmer, which must 
he considered deleterious because it xs not natural . So it is 
necessary to assist the programmer with a programming language 
and a compatible operating system that contain features, attrac 
tive to use, that encourage the creation of multiple, independent 


proce 


sses. The use of a block-structured language encourages 
programs to be written as collections of small, closed subroutines 
ALGOL, PL/I and HAL are among the languages that possess this 
property. In addition to structure, a language can provide a 
convenient and natural way to interface with the executive by 
recognizing tasks as syntactical entities. The multi— tasking 
features of PL/I and HAL encourage the programmer to think as 


he 


programs in terms 


of processes which are amenable to scheduling. 


The multiprocessor operating system must support the 
requirements of parallel tasking by providing adequate communi- 
cation and synchronization primitives, and by protecting shared 
data against conflicting concurrent accesses. These requirements 
are discussed in more detail later. 


2.2.2 Exclusive Sections 

In a general purpose multiprocessor certain operations 
are concerned with the manipulation of unique system data such 
as, for example, information maintained by the Process State Con 
troller, which contains the current dynamic state of all proces- 
ses. Execution of the Process State Controller is an exclusive 
operation: only one process may perform it at a time. In a 

simplex computer this is achieved trivially: it is only neces- 

sary to inhibit interruption of the single processor by external 
happenings to assure exclusive execution of the Process State 
Controller. A multiprocessor requires a more elaborate mechanism 
to prevent the simultaneous execution of such critical functions 
by two or more processors. Such mechanisms cause the conflicting 
processes to become serialized in time, each being admitted to 
the critical section through interlocking turn-stiles (a general- 
ized mechanism is described later) . The net effect is that when 
ever two or more processes wish to enter an exclusive section, 
only one may do so and continue executing: the other (s) must 

wait. If the exclusive section is designed to inhibit the alter- 
nate assignment of the processor (e.g., if it is the Process 
State Controller), then throughput temporarily falls until the 
other processor is through with the exclusive section. This loss 
of throughput cannot be made up again. Note that in a batch en 
vironment conflicts of this type are rare, but in a real time 
system of short tasks, with frequent process state changes, the 
probability of conflict may become significant. This precipita 
tes the following quandary: to encourage parallelism a multipro 

cessor program should consist of many concurrent tasks, but to 
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avoid critical section conflict it should be organized into as 
large a serially-executable piece as possible! 


2.2.3 Shared Data 

There is a problem with shared data, aside from the 
need to protect it from simultaneous modification. It is asso- 
ciated with the creation of copies of shared data. In many com- 
puter designs , performance improvements have been achieved by 
localizing lengthy sequences of operations within the fast logic 
of the processor, rather than executing out of main memory. (The 
cache memories of the IBM 370 series [2] and the task memory of 
the Navy's AADC [All Applications Digital Computer] [3] are ex- 
amples of localized processing.) The problem arises because data 
is maintained local to the processor. If the data is shared with 
other processes , changes in the original or any of the copies 
must be reflected in all. Some means must be found either 

a) to allow one process access to another's local storage, 

b) to update all copies of shared data at the same time or 

c) to prevent old values from being used by other proces- 
ses until updating is performed. 

It should be pointed out that this phenomenon is en- 
countered whenever copies of shared data are created in any sys- 
tem: in the Burroughs B6700 series the problem arises through 

its use of descriptors. These are maintained in the stacks of 
individual processes. Whenever a descriptor needs to be changed 
(it is a common occurrence in a virtual memory system for a 
descriptored item to be transferred to back-up storage: the 

address field of its descriptor most be modified to reflect this 
change of whereabouts) , all processors in the B6700 are stopped, 
and all process stacks in main memory are searched for copies 
of the particular descriptor. The B6700 was not designed as a 
real time controller so the ensuing loss of processing time was 
not considered objectionable by the designers. It is a different 
matter for a space station computer, however. The Multiprocessor 
design developed by Intermetrics [4] employs a unique approach 
to a similar problem. The copy of a descriptor may be maintained 
in an associative memory local to a processor. This avoids acces- 
sing the descriptor through three levels of indirection 
each involving main memory references. Changes in the descriptor 
are very quickly signalled by the provision of a specific machine 
instruction which cancels the appropriate entry in the associa- 
tive memory. The Intermetrics multiprocessor avoids local copies 
of the data itself, and thereby foregoes the potential perfor- 
mance advantages of local buffer or cache-type processing. 
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2.2.4 


Conflict Over System Resources 


The most critical resource is main memory. As the num- 
ber of processors increases, the possibility of conflict between 
them over the use of memory increases. As in the case of shared 
data, a resolution of conflict results in one or more processors 
losing processing time, and the right hand integral of the ex- 
pression for throughput given earlier again loses to the left 
hand. The device of interleaving the modules of a memory system 
can be used to minimize the delays incurred by conflict, but it 
exacts a cost in added hardware complexity. Its effect is to 
randomize memory usage and thus to obtain stationary behavior. 
Another approach is to partition memory amoung the various pro- 
cesses so that processors tend to execute out of physically 
ocpaiuL-w modules . i.c., mate the memory us <>.q n very determinis- 

tic. This technique implies a sophistication of the operating 
system, a well-known job stream, and a memory system of suffi- 
cient modularity. 

The network interconnecting processors, memories and 
I/O units is a more critical element in a multiprocessor than 
in a simplex system. With more than one processor requesting 
memory at a time, this bus itself becomes a source of conflict. 
It would seem that a technique that lowers the frequency of use 
of the bus would lessen the probability of such conflict. For 
example, the use of a cache memory, by encouraging local execu- 
tion, would appear to make bus use less frequent. However, 
analysis shows that the probability of bus conflict actually 
increases with increasing speed of the cache, thereby defeating 
any performance advantage. 

In summary, techniques devised to minimize conflict in 
a multiprocessor are susceptible to the following drawbacks, any 
or all of which combine to prevent the multiprocessor throughput 
from equalling that of the equivalent simplex processor: 

a) Increased hardware complexity and cost, 

b) Increasing operating system sophistication, usually ac- 
companied by increased overhead in space and time. 

c) Reduced throughput due to delays introduced to resolve 

conflict . 

The more processors in the system, the more marked is 
this effect. Only in a particular application, for which the 
characteristics of the work load can be anticipated, is it pos- 
sible to deduce the number of processors required to achieve 
a given performance cost effectively. In the absence of such 
information about the environment of the multiprocessor, this 
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limit is very difficult to determine. As a result, almost all 
practical designs of multiprocessors to date have been limited 
to the degenerate case of two processors. Some designs have 
even dedicated functions or resources to each processor in order 
to avoid some of the above problems, resulting in configurations 
of dual computers rather than dual processors. 


2.2.5 Overhead 

The proceeding sections have cited several factors that 
contribute to the complexity of functions that a multiprocessor 
opex-ating system is required to perform. Each factor contribu- 
tes to the overhead of computational time and memory space con- 
sumed by the operating system. Matters are further aggravated 
because the many activities going on simultaneously in a mul- 
tiprocessing environment take on the characteristics of a que- 
ueing problem: their deleterioxis effects are in general worse 

than additive, i.e., the loss in real throughput is a non-linear 
function of the number of contributing overhead mechanisms. 

But to end this section on a positive note, it should 
be realized that this depressing parade of multiprocessing dif- 
ficulties has a corollary: small efforts to limit the damaging 

effects of each of the mechanisms discussed in this section can 
yield dramatic improvements in throughput because of the expon- 
ential nature of their interaction. 


2.3 Exclusion and Synchronization 

Any multiprogrammed system requires operating system 
primitives for the communication and mutual protection of the 
concurrent processes. In a multiprocessor, these activities 
can be actually time-concurrent and these primitives must be 
implemented in a combination of hardware and software. The 
problem of protection against unwanted interactions will be 
reviewed first, followed by a discussion of synchronization. 


2.3.1 Exclusion Primitives 

In a simplex computer a basic exclusive operation may 
be implemented in software, but a multiprocessor needs hardware 
assistance for such an operation, because of the true time- 
concurrency of execution of two or more processes. The hardware 
must be capable of reading the value of a variable, and then 
rewriting the variable with a new value in one uninterruptible 
operation. An example of such an instruction is the TS (Test 
and Set) of the IBM 360 series, which writes all ones into a 
specified byte and sets a condition code with the original 


2-8 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 



contents. The Burroughs B6700 RDLK (Read with Lock) instruction, 
which stores the contents of the B register into the location 
whose address is contained in the A register, but leaves the 
previous contents of the location in the B register, is closer to 
a generalized non-divisible read and write operation. 

The actions of a set of general operating system pro- 
cedures designed to provide the exclusion primitive are as fol- 
lows : 


ENTER 


Check for occupancy of pro-j 
ocdure . Set Lock . If 
locked, enter wait queue. 




(critical section) 




EXIT 


Check for occupancy of pro- 
cedure. Remove self from 
wait queue. Inform execu- 
tive to wake next in queue, 
if any. 


How these actions are implemented using a fictitious 
non-divisible read and write instruction NDRW is illustrated 
in Figure 1. Let the execution of NDRW exchange the contents 
of the operand, MUEX, with the contents of the accumulator. 
MUEX may contain the following values: 

0 No process executing critical section 
(i.e., section is "free") 

1 Critical section is being executed by 
one process 
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2 , 3, . . . n 


negative 


Critical section is being executed by one 
process, and 1,2,... n-1 are waiting to 
gain access. Requires a MUEX queue struc 
ture to be maintained by OS. 

Procedures ENTER or EXIT are being executed 
by a process. (The OS primitive itself 
must be protected against multiple use.) 


The actions surrounded by dotted lines indicate the 
execution of the Process State Controller function. Noce that 
the final updating of MUEX in cases where a process is to be 
placed in the wait state, or readied to execute the critical sec- 
tion, must be done within the Process State Controller to prevent 
interruption ot tne sequence. 

This exclusion mechanism must be expanded if it is 
required to accomodate the comprehensive Update Blo ^ k ca P a ^ X Y ' 
for controlling the accessing of common data, provided in the 
HAL lan quaere [5]. It is not always necessary to prevent a yp 
of access to shared variables: a shared variable can be re^d, 

as Iona as it is not actually being changed. The ability to 
differentiate between types of access reduces the time for which 
a requesting process must be made to wait, with consequent lm 
provement in throughput. The HAL Update Block is in efiect a 
modified form of critical section. Every variable that p 
addressed within an Update Block has associated s ta tl s 
type" attribute. The lock can assume the following states. 


a) 

b) 

c) 

d) 


Free : 
Read : 
Copy : 
Write : 


Unlocked 

Accessed for reading only 
Accessed for modification 
Being modified 


A variable that is to be modified is first ^ ied ' aad 
all intermediate computations are performed on the J°Py • 
is the meaning of the "Copy" state. Final values are written 
from the copy to the actual variable after the state of the 
lock has been raised to "Write". The testing and setting of 
the states of locked variables requires the use of the NDRW 
struction. A requesting process is allowed into the Upda 

Block only if the type of access requested is compatible with 

the current state of all locks within the block. example, ^ 

a reauest to read the variables is allowed if the curre 
of airlocks is "Free", "Read", or "Copy", but is not allowed 

if any are in "Write". 
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An operating system mechanism to implement Update Blocks 
involves the maintenance of linked queues (see Figure 2) . Every 
locked variable has associated with it a queue of requesting 
processes, each identified with its individual access type. All 
queue elements associated with a given process are also linked, 
to facilitate the response to changes of state of the processes. 


The Intermetrics design of a multiprocessing operating 
system [4], defined a pair of generalized primitives, ACQUIRE 
and RELEASE, of the form: ACQUIRE (Mode, Category, Name, Access) 

where each of the terms has the following meaning: 


a) 


b) 


c) 


d) 


Mode: The calling process is placed in the Wait 

state if access is not immediately pos- 
sible, or an immediate return may be spe- 
cified with an indication of why access 
could not be allowed. 


Category : 


Name : 


Access : 


Data, code or device. The ACQUIRE primi- 
tive is applicable to the protection of 
shared data, the implementaiton of exclu- 
sive sections, or the use of a shared de- 
vice such as a printer. 

Identifies the item in the category, e.g., 
the name(s) of the specified shared variables. 

Shared, update or exclusive access request. 
.These are analogous to HAL's Read, Copy, 
and Write lock type states. 


It is possible to define any type of required exclusive 
operation in a given system with these two primitives. 


2.3.2 Synchronization 

In order to provide for communication between parallel 
processes of a multi-tasked environment it is convenient to in- 
voke the concept of an "event". An event is a variable whose 
state reflects the occurrence of an activity within the system, 
e.g., the completion of a lengthy computation or the arrival in 
memory of a previously requested item of I/O. The process await- 
ing the activity is associated with the event. The "signalling" 
of the event results in the process being made ready to continue. 
For illustration, let Tasks A, B and C be three independent tasks, 
all scheduled during the execution of some master Program. Sup- 
pose it is appropriate to schedule Task C only when certain com- 
putations have been completed by Tasks A and B. Tasks A and B 
may be executing on separate processors, and thus be unaware of 
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Figure 2:" Update Block Structure 












one another. In which case, they cannot easily cooperate in 
the scheduling of Task C. However, if each were to signal 
an event on completion, e.g. , EVENT_A and EVENT_B respectively, 
then the event mechanism can provide the synchronization that 
causes Task C to be scheduled as soon as both EVENT_A and EVENT_B 
have been signalled. 

The language multi-tasking features that were advocated 
earlier to help keep a multiprocessor busy are supported in PL/I, 
ALGOL and HAL by event mechanisms of varying sophistication. 

The Intermetrics multiprocessor design [4] specified a very com- 
prehensive event structure which enabled complex logical expres- 
sions to be evaluated as event signals. In this design events 
arc controlled by primitives of the form 


( SET) 

< / (E, n, E , E 2 , • « • , E ) 

(reset) 


which is interpreted as "set (reset) event E when n of the events 
in the list E^ through E m , are signalled." If n = m, this ex- 
pression is the boolean "and" of all listed events, and if n = 1 
it is the "or". The primitives also have a simpler form 


SET ) 

l (E) 
RESET » 


Response to the signalling of events is basically of two forms: 


WAIT (n , E lf E 2 , • , E m ) 


and 


ON (n , E lf E 2 , ..., E m ) <code> 


In the first, as the WAIT is executed the process is placed in 
the Wait state until the event expression becomes true. The 
second statement causes an interruption of the process as soon 
as the expression becomes true, to execute the procedure spe- 
cified in the "code". 
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The implementation of an event structure involves mul 
tiolv- linked queues of event elements which allow the associa 
l- i on v between the processes involved in declaring, signalling 
aid Jc=r»onSing to Lents to be established, executed, and re- 
mSvedL a dynamic fashion. It is perhaps superfluous to point 
oSt that such a mechanism in a multiprocessor environment re- 
quires processors to be able to interrupt one another This 
quire- pynnnle in the Burroughs B6700 by 

if 'he RCA 215 by the " INTERRUPT CPU" instruc- 

tions . 


2.4 


Schedule 


The scheduling function of the operating system ensures 
that processeLare . 

o? PrllsiL'Ll ciiltrol and Resource Allocation defnied earlier. 
This section will discuss briefly the following aspects of this 
function : 

Ensuring that computation time and space are properly 

a) apportioned amon/the processes according to predeter- 
mined needs, while maintaining an J ffi _ 

tween the conflicting requirements of throughput, ef 

Hencv and response. Throughput is defined as the 
amount of useful work accomplished by the total multi- 
processor system, efficiency is the degree of ttiliza 
tion of the basic components of the system (e.g., P . 
cessors , memory modules, I/O devices) and response is 
the ability to react to a given stimulus. 

b) Ensuring that competition between processes in their ' 
demands "for resources do not produce catas p 
ditions , such as deadlock or thrashing. 

c) preventing the resulting 'computational overhead , espe- 
ninllv of time in a real-time control system, but also 
of space, from becoming excessive (the definition o 
"excessive" is not attempted here!). 

2 . 4.1 Space and Time Allocation 

The computational activities in a space station multi- 
processor are expected to fall into the following categories: 
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Category 


Characteristic 
Response Range 


Time 

Cri ticality 


Examples 


Batch 


10 secs-mins.. 


non-critical Lengthy computations. Off line 
experiment data processing 


Interactive 0.1 sec-10 secs. 


non-critical 


Crew operational sequences . 

Time sharing by scientific per- 
sonnel . 


Real Time 1 ms- 100 ms 


Real Time 1 ms -100 ms 


non-critical Control of scientific experi- 
ments. Operational equipment 
status monitoring 

critical Operational equipment servicing: 
strapdown IMU. Closed loop con- 
trol: autopilots, etc. 


Processing tasks in the batch category can, to an extent, 
ignore the constraint of time. The allocation of memory space or 
other system resources such as common data, input file, I/O de- 
vices, processors/ can be considered with more freedom. The 
presence of this category in the total work load can provide a 
measure of global optimization in the use of system resources to 
maximize efficiency . 

The time-critical real-time tasks can not make such 
compromises. Resources must be ready when needed. The need 
is often (but not always) randomly determined. Unless it is 
composed of highly repetitive tasks, the real-time component 
of the work load prevents high values of throughput and effi- 
ciency from being attained. 

A work load consisting of components from each category 
must be so arranged and presented to the computer system that 
all tasks can get sufficient cuts at the system's processing re- 
sources. Obviously, no amount of intelligence built into an 
operating system will supply enough computational resources to 
a work load whose demands exceed the capability of the machine. 

An operating system can be designed to contain features and to 
operate in a way that matches the characteristics of the work 
load. But it remains the responsibility of the user of the sys- 
tem to assign a given work load to the machine in such a way 
that it does not overload the system. 

Task scheduling can be approached from two extremes: 

a) Synchronous, or time slot scheduling. Each task is 

allotted a different, but fixed, interval of time for 
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execution, which is available at multiples of fixed 
minor cycle intervals. 

Demand Scheduling. Tasks are allocated processors 
and other resources on demand, at execution time, ac- 
cordina to the needs and importance of the task and the 
availability of the resources. Tasks are differentia- 
ted in importance by a priority value which stays as 
initially assigned, or changes as a function of time or 
the tasks' status. 

The advantages of the synchronous approach are: 

Minimal overhead, since scheduling is pre-determined; 

The scheduler is simpler, being essentially table driven 

The fixed schedule of task execution eliminates problems 
associated with code and data sharing, and does not re- 
quire re-entrant code; 

The load may be evenly distributed over the available 
time ; 

The deterministic behavior makes system verification 
easier. 


The difficulties associated with it are: 

a ) It is difficult to structure programs so that they may 
be time-sliced; 

b ) Eacb time slice must be sufficient to accomodate the 
worst case, so on the average will be under-utilized; 

c ) It is difficult to accomodate response to random events 
such as crew inputs . Response to system failures is 
especially difficult, unless recovery from all classes 
of failures is pre-scheduled. 

d) The structure is inflexible to change. 

These disadvantages are all overcome by the demand scheduling 
approach, which, however, suffers from an increased degree of 
difficulty because of its greater complexity, and because it 
is more difficult to verify. 

In a functional design of an executive for the Space 
Shuttle central computer. Intermetrics has proposed a combined 
synchronous and demand scheduled approach [6]. The repetitive, 
time-critical functions which can be implemented in short, 
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complete sections of code are executed by a synchronous "fore- 
ground" scheduler driven by timer interrupt, at 40 ms intervals. 
The majority of the remaining tasks are scheduled on demand as 
a "background" activity according to pre-assigned priority 
values. Communication between foreground and background is by 
an event mechanism, in essence similar to that described in sec- 
tion 2.3.2. 


2.4.2 Deadlock Prevention 

OS/360 has three resources to allocate to each job/step. 
These are core storage, data sets and peripheral devices. The 
allocation algorithm is summarized in Figure 3. Note that all 
data sets for the entire job are allocated at job initialization 
time and are bound for the duration of the job. In addition, 
all devices are allocated at step initialization time and are 
bound for the duration of the step. This approach may be costly 
since some of the resources allocated to a task may remain un- 
used for long periods. 

Alternatively, resources may be allocated dynamically, 
i.e., while the process is running. Unfortunately, now dead- 
lock prevention becomes a more difficult problem. However, 
some practical solutions have been suggested [7], although a 
time overhead must be paid if they are implemented. 

The suggested methods involve keeping track of the state 
of the system by means, of state graphs or matrices. When a re- 
source is requested by an executing process, the availability 
of the resource is checked. If it is presently unavailable, the 
algorithm must determine if it is safe to put the requesting 
task in the wait state. To determine this, it checks the state 
matrices of the system as they would be if the request were 
enqueued for the resource. When a safe condition results, the 
request is enqueued, and the task is placed in the wait state. 

On the other hand, if an unsafe condition results, the request 
must be denied and the task so notified. The task can then de- 
cide if it wishes to cease execution or it if can proceed with- 
out the resource. (Some subtle problems to be aware of, in 
implementing such an algorithm, have been overlooked by several 
authors and are discussed by Holt [8],) 

While it is easy to see that dynamic allocation is 
most economical in the amount of time system resources are un- 
available, some time overhead must be paid each time a process 
requests a resource. The OS must check the state matrices to 
determine if safe states will result. This process can be 
lengthy for a system with many resources and many ready tasks. 

One must remember here that the overhead is really that time 
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FIG. 3 : OS/360 Resource Allocation Algorithm 
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used for dynamic allocation over and above that which would other- 
wise be spent for allocation at job and step initialization times 
as described above. 

Unfortunately, no analytic studies or simulations of 
these algorithms have been done to evaluate overhead costs . How- 
ever, with careful thought given to the implementation of a 
dynamic algorithm, its overhead can be held to a minimum. In 
any case, the advantages of dynamic allocation would seem to 
overshadow any time overhead that results . 


2 . 5 Memory Management 


Management of the use of memory is potentially the 
most critical activity of an operating system. It is very de- 
pendent on: 

a) the structure and characteristic behavior of the appli- 
cation software. If the work load is well known and 
dynamically predictable, especially with regard to its 
memory requirements, allocation of space can be pre- 
determined, by pre-planned overlays for example. 

b) The system architecture. If sufficient operating memory 
is provided to accomodate all programs at all times, dy- 
namic allocation problems are eliminated. If, however, 
a virtual memory design is adopted for its potential 
simplification of programming and its cost effectivity, 
the operating system becomes intimately involved in 
creating and allocating memory space, and its detailed 
design is further affected by the technique adopted for 
addressing the virtual memory system. 

c) Memory technology. The architecture of a virtual memory 
system and the functions of its operating system are 
significantly different for secondary storage with moving 
head disks, than for solid state block-oriented, random 
access devices such as the experimental magnetic bubble 
domain memory . 

Although memory management can assume a critical role in 
determining operating system size and efficiency, its problems 
cannot be addressed in detail in the absence of a memory hier- 
archy definition. The following review of methods of operating 
memory utilization is presented to underscore some of the factors 
involved in providing increasing levels of operating memory uti- 
lization by multiplexing. 
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2.5.1 


Operating Memory Multiplexing 

The following examples describe practical applications 
of a number of techniques for increasing the utilization of op- 
erating memory. 


2 . 5 . 1 . 1 Mon-m u lti ple xed Memory 
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In a non-multiplexed system 
of the program serves both to esta- 
n names found in "subroutines" (which 
intained units of program code) , and 
s and physical locations in memory. 

assembly, the mapping information is 
and is saved and accessible only as a 
computer Simula Lux', for example. Most 
are of this design, usually because of 
y requirements, typically 8K to 32K 


2. 5. 1.2 Part it ioned Memo ry: A simple form of memory multi- 

plexing is used when the physical memory is large enough to sup- 
port the requirements of more than a single program at a time. 

The OS/360 MFT and MVT systems implement fixed and variable par- 
titions respectively. The normal objective of concurrently- 
loaded programs is to provide more efficient use of the proces- 
sor by increasing the chances that some program can use the 

CPU when another is waiting for completion of I/O operations. 

As in sequential execution, the mapping between names 
and locations is applied in all places at the time of loading, 
and the map is of no further use to the execution of the program. 

To further increase processor efficiency, a high-speed 
secondary storage device may be used for "core— swapping . This 
involves writing the contents of a partition out to the device 

before its execution has been completed in order to make room 

to bring in some other program ready to run. Because the name - 
location mapping is not dynamically applied, the information 
must be returned to its original location when its execution is 
to be resumed. 


2. 5. 1.3 Part itioned Memory with Relocation Registers: Under the 

above mechanization, the application of the name- location map 
takes place at one time, but over many spatial places. This has 
the advantage of getting the mapping finished; however, it has 
the disadvantage that the mapping is not readily reversed or 
modified. Several systems (e.g., PDP-10, Univac 1108) use an 
alternate scheme which re-applies the mapping each time. This 
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is achieved by providing one or more relocation registers, whose 
function is transparent to the software, which supply offset 
values to be combined with logical or virtual addresses generated 
during the program's execution. A disadvantage of this approach 
is that it requires additional hardware to perform the combining 
as part of instruction execution. However, it has the valuable 
characteristic that the mapping remains available for modifica- 
tion, so that program and data sections may be relocated in the 
operating memory and only the relocation values need to be 
changed in the process. Thus, storage in use can be compacted 
to collect available space into one contiguous piece when neces- 
sary to find room to load an additional program. 

As in the partitioned memory scheme, "core-swapping" 
may be used for additional multiplexing. However, the use of 
the relocation registers makes it possible to return the infor- 
mation to any convenient location, rather than the precise place 
from which it was written. 


2. 5.1.4 Paging : An alternate to the use of relocation regis- 

ters is to divide the program and data space, linearly arranged, 
into a series of "pages" of a fixed size, ordinarily a power of 
2 (e.g., XDS Sigma 7, CDC 3800). In address formation, a group 
of bits from the logical address is used to select a page- 
location word from an array called a page-table; this word con- 
tains the memory-address of the page if it is currently there. 
Otherwise, an indication of the absence of the page is provided, 
along with the secondary-storage location at which the page may 
be found. The physical storage space is thus divided into fixed- 
size page frames, and the mapping between names and physical 
location is dynamically applied, A strong advantage of this 
approach is that logically contiguous space need not be physically 
contiguous, nor need it even all be present. The relaxation of 
the pages for occupancy of storage space by implementing some mea- 
surement of page reference behavior (with hardware help) . Pages 
appearina to be less needed may be overlaid with more lively ones. 


Because all page frames are the same size, space mana- 
gement is simple, and requires only modest overhead at execution 
time. On the other hand, the page boundaries fall at arbitrary 
locations in code or data, rather than at logical divisions. The 
average usefulness of words in a page is therefore reduced, since 
a logical entity may occupy only a small part of a page, or cross 
a page boundary. 


2-22 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 



2.5. 1.5 Segmented Addressing: The simplest segmentation on a 

logical (If "not operational) basis is the scheme used in the 
Burroughs B6700 and its predecessors. Each program block is 
compiled into a virtual address space of its own, called a seg- 
ment; locations may then be accessed by specifying a segment 
number and an offset from the beginning or the segment. In 
execution, the name-location mapping is applied dynamically. 

Each segment has a segment descriptor which contains the physi- 
cal location of the beginning of the segment. However, this 
descriptor can also contain an indication that the segment is not 
in storage at the moment; in this case, the address in the des- 
criptor is the secondary storage location at which the segment 
may be found. 

An advantage of this type of segmentation is the direct 
relationship between the segment size and the logical unit of 
program or data it contains. This characteristic increases the 
average usefulness of words transferred in a segment load. 

A disadvantage of this scheme is that segments are 
small, segment descriptors are therefore numerous, and must 
consequently be located in operating memory rather than high- 
speed processor registers. The access to these necessarily 
slows down the address formation process; consequently some 
scheme of buffering in a small set of fast registers is usually 
utilized to shorten the access delay (see section 2.2.3) . A 
second disadvantage is that storage allocation occurs in vari- 
able sized units and is therefore more complex and consumes 
more processor time than for fixed-sized pages. 


2.5.1. 6 Segmentation Plus Paging: This method of addressing 

and multiplexing was developed by the Multics group at MIT Pro- 
ject MAC. It is implemented most ambitiously on the GE (Honey- 
well) 645 designed for Multics, and also on the IBM 360/67. 

In Multics, segments tend to be large, and each is divided into 
fixed— size pages. Even page tables are paged, since they other- 
wise would occupy too much operating memory. Paging is the 
mechanism which accomplished multiplexing; segmentation is uti- 
lized for other purposes which are not relevant to this report. 
However, it should be mentioned that segmentation is implemented 
in such a way that when two independent processes refer to the 
same segment, both processes utilize the same page table. Shar- 
ing is thereby implemented in a general and powerful way. 

The Intermetrics multiprocessor design [4] featured a 
segmented virtual memory system based in principle on the Bur- 
roughs designs. The policies for space allocation, segment 
placement, and replacement were, however, novel implementations 
of the operating system. The overall objective of the design 
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was to reduce the usual overhead consumed by the memory manage- 
ment function, by hardware assistance of address translation 
with associative memories local to the processors, and by spe- 
cially tailored OS routines to handle segment I/O. 

A more detailed examination of the characteristic 
differences between paging and segmentation, and the factors 
influencing virtual memory design is presented in Chapter 4. 


2 . 6 I mplementat ional As p ects 

This review, far from complete, of multiprocessor op- 
erating system design problems closes with some comments about 
the implementational aspects. The major objectives of anyone 
embarking on the design of an operating system should be: 

a) That the completed system work very closely to the way 
it was intended; 

b) That it not take forever to finish; 

c) That the resulting design be non-subtle, that it may 

be easily understood, maintained, and if necessary 
modified, other than by its creators. 

2.6.1 System Specification 

A big step towards accomplishing the first objective 
is to establish clearly in the beginning what the operating 
system is expected to do, and how. A considerable fraction of 
the total programming effort should be devoted to identifying 
the functional requirements, and then thinking out an overall, 
coherent design that not only satisfies them, but possesses 
enough flexibility to accomodate later modification and addi- 
tion. The end item is a detailed design specification which 
deals with the structure to be implemented and its operating 
characteristics, and includes a description of how the com- 
pleted system is to be verified. 


2.6.2 Structure 

The second and third objectives are largely a matter 
of the way in which the software of the operating system is 
structured, and the techniques used to implement that struc- 
ture. 


Comprehensive operating systems have acquired a bad 
reputation for complexity, cost and ultimate unreliability, 
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largely perhaps as the result of the widespread usage' of the 
IBM 360 series of computers. OS/360 was very ambitiously con- 
ceived at a time when rigorous techniques of software construc- 
tion (and the penalities of ignoring them!) were not as well 
researched and understood as today. Problems with the use of 
OS/360 , and other designs, have prompted much study into the 
theory and practice of operating systems to be undertaken, es- 
pecially during the last five years or so. A gathering body 
of knowledge on techniques of design and operation has become 
available (see, for example [9]). 

Dijkstra has pioneered the disciplined approach to op- 
erating system design [10] . He organized the functions of a mul- 
tiprogrammed operating system into a number of sequential pro- 
cor. so:'- These were then hierarchically arranged to 

form several independent levels of increasing abstraction of 
machine operation. For example, the lowest hierarchical level 
was that of the real machine itself. At the next to lowest level 
were procedures for allocating processors to processes and field- 
ing interrupts from the real time clock. The level above that 
managed the operation of the virtual memory, without concern 
for processor availability. The next level fielded the inputs 
from the operator keyboards, and so on. The application pro- 
grams formed the highest level. A programmer was thus able 
to view the combination of hardware and s o f tv, 7 are as a "virtual 
machine", representing an abstraction of the real machine. Need- 
less to say, the whole concept precluded the use of machine lan- 
guage coding by any application programmer, since this would 
have cut straight through the screening levels of "virtual ma- 
chines". Each level of the system possessed a large degree of 
independence of the other levels , and could be separately con- 
ceived, .implemented and tested. 

Other operating system designs with different opera- 
tional requirements and system configurations would probably 
depart from the functional separations made by Dijkstra, but the 
basic philosophy may be adhered to. 


2.6.3 Systems Programming Language 

Just as a problem oriented higher level language assists 
in the structuring and implementation of applications software, 
the use of a language suited to the definition of OS functions 
has gained much support from operating system implementers . 

The advantages can be viewed from both a managerial and a tech- 
nical aspect. The managerial benefits of HOL usage are too 
well established to be repreated here. Various authors have 
defined the features that would make a systems programming lan- 
guage easy and efficient to use [11]. Almost all agree that 
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the language should possess a block structure and enforce name 
scope rules. It should contain control features such as pro- 
cedures and functions, the statements IF THEN ELSE, DO FOR, and 
DO CASE. Some language designs restrict data types to those 
generally agreed to be useful to systems programming, namely 
bit, character, pointer and various forms of arrays. Others, 
following the example of PASCAL [12] contain more powerful and 
flexible data structures, which allow the systems programmer 
freedom to adapt the language to his specific problem. The 
ability to address specific machine features is necessary, al- 
though the major portion of any operating system can be machine- 
independent. The need to generate efficient code is clear, if 
only to overcome the reluctance of non-believing systems pro- 
grammers to code in a higher level language! Almost all advo- 
cates insist on the absolute necessity of readability in the 
language, and the provision of comprehensive diagnostics by the 
compiler. From these characteristics, it is evident that sys- 
tems and application programming languages have quite similar 
objectives, and differ mainly in the natural incompatibility of the 
data types recognized. Several attempts have been made, therefore, 
to adapt existing HOLs for system programming, as the following 
examples illustrate. 

A subset of PL/I was chosen to code the operating system 
for the comprehensive Multics system at MIT [13] , which is based 
on Honeywell 6000 computers. The Burroughs Corporation has 
developed several versions of ALGOL 60 with differing degrees 
of machine dependence [14] for different B6700 systems program- 
ming applciations , as a consequence of their long standing use 
of ESPOL in the B5500. There is Extended ALGOL for the bulk of 
systems programming, including the Extended ALGOL compiler it- 
self; Data Communication ALGOL, which allows the control soft- 
ware. for communications interfaces to be conveniently programmed; 
and ESPOL, the original systems language, which enables many of 
the B6700 features such as stacks, registers, memory, the multi- 
plexors, peripheral devices, etc., to be addressed directly. 

Several languages have been developed to handle systems 
programming for specific machine architectures. The University 
of Toronto is developing SUE for system programming on the IBM 
360 [15]. An extensible language LSD is being designed for sys- 
tems development on the IBM 360 at Brown University [16] , al- 
though it is not yet operational. PL 360 [17] a language de- 
signed by Wirth at Stanford University for the IBM 360, has 
features that make it attractive to systems programming. 
Carnegie-Mellon has developed and used BLISS for its DEC PDP- 
10 [18]. 


It is strongly recommended that the operating system 
for the SUMC is designed and written in one of these systems 
programming languages, or at least in some tailored subset. 
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Most of the compilers have been written in the language itself, 
which lessens the difficulty of transferring the compiler from 
its original host machine to another. 
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Chapter 3 

INTERRUPT STRUCTURE 


This chapter will discuss various aspects of the in- 
terrupt structure when applied to a multiprocessor. The first 
section will present a list of assumptions upon which the fol- 
lowing sections are based. The second section presents a brief 
ca Leg 1 ~ ti on c interrupts expected withi n the environment of 
the space station multiprocessor. The third section discusses 
various problems that are encountered when attempting to develop 
an interrupt structure for the multiprocessor. 


3.1 Assumptions 

a) The basic assumption is that the concept of interrupts 

is indeed required. It is possible to conceive of 
computer systems that are v/ell specified, in which all 
equipments are synchronized and serviced in a predeter- 
mined cyclic fashion. However, the system contemplated 
for the space station is not well specified. It will 

have to respond to conditions not anticipated in the 

program flow. Therefore, the need for interrupts is 
postulated. 

b) A true multiprocessor is assumed. This includes a 

"floating executive" and a configuration with three 
or more processing units. With a floating executive, 
any process can be executed on any processor. There 
are no functions dedicated to any processor- This ex- 
cepts the I/OC, which does serve a specialized function. 
Three or more processors are assumed so that the gen- 
eralized solution to multiprocessor interrupt handling 
can be addressed. 


3 . 2 Interrupt Categorization 

An interrupt can be defined as any condition which 
causes an involuntary interruption in the sequence of execu- 
tion of a process. The interrupt is not explicitly anticipated 
in a program's code. It can be considered to be an involuntary 
procedure call to the interrupt servicing routines , with an ul- 
timate return link to the original process. 


3-1 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 


Interrupts may be categorized into three distinct 

classes : 


3.2.1 Process Oriented 

Process oriented interrupts are those associated with 
the process in execution. There are a number of distinct types. 
Arithmetic and control traps are caused whenever an unacceptable 
condition presents further execution. 7m interrupt from a 
"watchdog timer" indicates that a process has been running for 
an excessive time. 


The above two process-oriented class of interrupts are 
synchronous with the process and occur while the process is run- 
ning. There exists a class of process-oriented interrupts which 
can occur when a process is in a waiting state. These inter- 
rupts, sometimes called software interrupts, result from HOL 
statements of the following form, as discussed in section 2.3.2: 

ON (event) <code block> 

This statement establishes a linkage which causes the 
specified <code block> to be executed when the specified (event) 
is signalled. if the process is running when the (event) 
is signalled, then it is interrupted to execute the <code block>. 
If the process is not running when the (event) is signalled, 
then as soon as the process which issued the ON statement enters 
the running state it will be interrupted to execute the <code 
block> . 

3.2.2 System Oriented 

This category of interrupt does not have any particu- 
lar affinity for the currently running process. Conditions 
such as I/O Complete, I/O Error, and Absent Segment Trap fall 
into this category. Both I/O Management and Memory Management 
are executive functions . 

Many failures or error conditions, such as power fail- 
ure, can be considered system oriented. 


3.2.3 Processor Oriented 

Even with a floating executive and no dedicated func- 
tions to particular processors, it does become necessary to 
direct an interrupt to a specific processor, independent of 
the process being executed. For example, in response to an 
error signal one processor might direct another processor to 
terminate or restart. The entire area of system initialization 
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and reconfiguration requires direct communication with specific 
processors. The processor-directed interrupt is a convenient 
mechanism for meeting this requirement. 


3.3 Mu 1 t i p rocessor Interrupt Problem Areas 

A number of problems involved in the servicing of in- 
terrupts exist. Some are aggravated in a multiprocessor en- 
vironment and some are unique to the multiprocessor environ- 
ment. Four major areas are discussed below. 


3.3.1 Which Processor to Interrupt? 

In a multiprocessor system, a question arises as to 
which processor to select to handle a given interrupt. For 
process oriented interrupts which occur while the process is 
running, the decision is trivial. The interrupt should be 
steered to the related processor. Similarly, so should proces- 
sor related interrupts. 

The remainder of the interrupts are system-oriented 
or non-running— process-oriented , and have no affinity for any 
particular processor. 

A number of options are possible in assigning a pro- 
cessor to service the interrupt: 

a) An arbitrary processor may be interrupted based upon 
some random selection algorithm. The interrupted pro- 
cessor may then execute a software routine which de- 
termines whether the interrupt condition is of higher 
priority than the process which was interrupted. If 
it is not, then the interrupted process will be sche- 
duled like any other process according to its priority. 

b) All -the processors may be interrupted. The interrupt 
service routine can be made a "critical section" of 
code which can only be executed by one processor at 

a time. The first processor to access this code ser- 
vices the interrupt. The other processors revert to 
their original processes. 

c) A sequential selection employing a "round robin" style 
algorithm may be used. In this way, the interrupts 
are loaded equally upon all processors. This option 
of course does not consider the process which is run- 
ning on the processor at the instant of interruption. 
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d) An assigned processor might service all interrupts , or 
specific interrupt conditions can be preassigned to 
specific processors. 

e) The processor executing the lowest priority process 
will be selected to service the interrupt. If the 
interrupt priority is lower than any running process, 
then the required interrupt response will not be exe- 
cuted until a process swap results in a lower priority 
process. 

The approach recommended in this Report is to provide 
a combination of c) and d) by placing within the I/O control 
an element of hardware which automatically determines the most 
interruptable processor (based upon the priority of the process 
running), and receives and distributes all potential interrupts. 

Running-process-oriented interrupts can by-pass the 
interrupt logic within the l/OC since the processor to be sel- 
ected is known a-priori . 


3.3.2 Response Time 

There is a small class of interrupts which require al- 
most immediate response. These are system oriented and deal 
with equipment failures or other emergency situations. One ex- 
ample is a "power failure" interrupt. This must be responded 
to within microseconds in order to move any volatile registers 
into permanent storage and then systematically to shut down the 
system. 


The class of conditions associated with arithmetic 
and control traps does not require instantaneous response but 
the running process can not continue until after the trap is 
serviced. Any trap condition falls into this category, even 
system oriented traps such as the ‘Absent Segment Trap. 

Quite often specifications are generated and systems 
built which require I/O Complete interrupts to be generated 
within micro seconds of an I/O completion. From a performance 
point of view, most I/O interrupts can possess a response time 
of the order of milliseconds. For example, if M3 requires an 
average of 10 milliseconds for each access, it is clearly un- 
necessary for its completion to be signalled within microsec- 
onds . 


3.3.3 Innovations 

A number of innovations may be suggested in the I/O 
interrupt area. These suggestions exploit the space station 
type of I/O, namely mass storage M3, and a data bus. 
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a) 


Quiet" I/O 


b) 


When the multiprocessor system workload is heavy, 
the frequency of Absent Segment Traps can be expected 
to be relatively high. Conventional processing of an 
Absent Segment Trap’ requires entry to an interrupt 
handler, initiation of an M3 operation, and placement 
of the process into the wait state. Upon completion 
of the M3 operation, an I/O Completion interrupt is 
signalled. The handler for this interrupt is then 
entered, the process waiting for the segment is readied 
another I/O operation to M3 is initiated if one is 
queued, and the processor allocation routine is calle 
to see if it is appropriate to assign a processor to 
the newly readied process . 


An alternate implementation is suggested to avoid, 
at least in most cases, the necessity for entering the 
I/O Completion interrupt handler when the segment trans 
fer is concluded. This is achieved by providing a 
capability in the I/O controller which causes it to 
make a choice of whether or not to signal I/O comple- 
tion. Thus a dynamic decision is made as to whether 
the interrupt should be suppressed or signalled, de- 
pending upon the existence of a queue of operations 
waiting for the device. If the interrupt is suppres- 
sed the condition is made known to the system by the 
setting of a bit field in a location accessible to the 
absent segment trap handler. After initiating the 
operation to make an absent segment present, this hand 
ler checks the completion-states of M3 segment transfers 
previously issued. The processes whose segments are 
found to have completed their transfers are readie ; 
thus the utilization of the I/O Completion interrupt 
handler is avoided. This diminishes the overhead for 
absent segment handling ,. especially under heavy load, 
when computational overhead is most detrimental to the 
system throughput. 


Data Bus Control 

If a command response data bus, with a minor cycle 
of 20 milliseconds, is employed then it is clearly un- 
necessary to interrupt the system after each penphera 
device is accessed. In principle, the synchronous na- 
ture of the data bus does not require interrupts for 
normal processing. However , one may consider the need 
for interrupts due to infrequent events : 

1) An interrupt might be generated by the Data Bus 
Control Unit if certain types of failures are 
detected . 
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2) For equipments which are interrogated at a very 

low frequency or even randomly an interrupt might 
be considered at the end of the request. 

Both of these suggestions impose little if any load on 
the system due to their very low frequency of operation. 
A checkout problem may, however, arise in trying to 
verify successful operation for infrequent interrupts 
at any point of execution in a program. 


3.3.4 The Interrupt Sequence 

When an interrupt is signalled to a processor, the de- 
tails of its local environment, the processor's status, must be 
saved so an eventual return is possible. In a stack oriented 
machine an interrupt response can be executed parasitical ly on 
top of the process' stack, with entrance and return functions 
performed automatically. 

Since procedures may be nested to multiple depth, so 
can interrupts. The only limit is the number of display registers, 
provided to mark the beginning of each lexical level in the stack. 


3.3.5 Interrupt Functional Response 

System or processor-oriented interrupts possess a sta- 
tic (pre-determined) response. Once the response is established 
it is not changed. However, for process-oriented interrupts 
(traps) one may conceive of situations where each process may 
desire a different response to particular interrupts. For ex- 
ample, one process might want to respond to a square root of a 
negative number trap by substituting a zero for the answer. 
Another process might deal with complex numbers and cause a 
re-entrance into the square root instruction with a change of 
sign of the argument. 

For all trap conditions the system must provide a de- 
fault option. It is suggested that a process be allowed to 
override this system option by providing its own response to 
particular traps. Any process at any lexical level should be 
allowed to specify, if necessary, its own response to process- 
oriented traps . 


3-6 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE. MASSACHUSETTS 02138 • (617) 661-1840 


i 



Chapter 4 


MEMORY HIERARCHY 


Memory is possibly the most difficult of any computer 
element to specify, implement and use. It is m this area that 
techno loci cal limits and cost factors are first encountered 
, ■ n .... . ^ ,, ... a ^ ^ •; q-f ;• dvanced hiryh performance coin~ 

outer system. The inability of a single, currently known, mem- 
ory technology to meet the conflicting requirements of high 
access speed'and high storage capacity has led to the hierar 
chical concept of levels of memory. 


4.1 Basic Hierarchy Description 

Within the multiprocessor structure, one finds a nunv 
ber of levels of memory used for varying purposes. 


4 . 1.1 MO - Micro Level Control Memory 

From one point of view, micro memory is only a parti- 
cular implementation of a control unit and therefore should not 
be considered part of the memory hierarchy. Alternatively, an 
other point of view suggests that micro memory should te 
used for execution of the frequently used operating system pn 
mitives and subroutines. It is from this secondary point of 
view that MO is considered an element of the memory hierarchy. 


4.1.2 Ml - Local Memory 

Ml storage is dedicated to the processing unit. Its 
function can range from a register set, as is found in the SUMC , 
to a complete cache memory as used in the IBM 360/85. The 
maior function of Ml is to increase the performance of the sys- 
tem. Its speed is in the 100 nanosecond access time range and 
its size can range from 16 words (for register storage) to 4K 
words (for a cache implementation). 


4.1.3 M2 - Operating Memory 

In a multiprocessor environment, M2 is that part of 
memory which is shared by the processing units and I/O controllers 
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M2 must of necessity consist of a number of separate memory 
modules so that simultaneous access of different modules may be 
made by the processing units and 10/C. M2 cycle time is in the 
1 microsecond range and its size is of the order of 100K words. 


4.1.4 M3 - Mass Memory 

M3, historically a drum or disk, provides the function 
of augmenting the M2 storage. It is used to hold all the pro- 
grams and data segments not currently being used in the proces- 
sing function. M3 is used to implement the concept of a larger 
M2 virtual memory. It is characterized by an access time in the 
millisecond range and a storage size consisting of millions of 
words . 


4.1.5 M4 - Archival Storage 

Archival storage (possibly implemented with a magnetic 
tape unit) is included for completeness. It is used as the re- 
pository of files and other information which does not undergo 
rapid change or frequent use. Conventionally, M4 is considered 
to be an I/O device and is controlled accordingly. 

The remainder of this chapter will concentrate on the 
relationships between the major elements of the memory hier- 
archy which contribute to system performance, namely Ml, M2, 
and M3 . 


4 . 2 Local Storage 

4.2.1 The Problem - Memory Contention vs. Performance 

One of the major reasons for using a multiprocessor is 
to increase the overall performance or work delivered by the 
system. If the extra performance were not required a unipro- 
cessor would be employed. Ideally, a system with R processing 
units should produce R times the work of a single processor sys- 
tem. One factor which tends to reduce the overall performance 
of the multiprocessor is M2 memory contention. The effect is 
to reduce the M2 cycle time (t 2 ) by yielding an effectively 
slower cycle time (t2 e ff) • 

One way to reduce memory contention is to provide a 
limited amount of dedicated memory local to each processor 
(Ml) . If Ml possesses a cycle time (tg) which is substantially 
faster than t 2 then a performance increase can be obtained. 
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4 • 2 . 1 . 1 Perfor ma nce 
shown in Figure 4 . 1 , 
assumptions : 


Model: Postulate the multiprocessor model 

ancl"make the following definitions and 


ng = number of Ml cycles per unit time for a single 
processor 


b) 

c) 


tp = Ml cycle time 

n 2 = number of M2 cycles per unit time executed by a 
single processor 


d) 

e) 


f) 


t 2 = M2 cycle time 

t 2e ff ^ effective reduced M2 cycle time due to memory 
contention 


W = work per unit time from a single processing unit. 
This is defined as proportional to the total num 
ber of Ml and M2 cycles per unit time. Usually 
processor work is defined in terms of the number 
of instructions per second. For a conventional 
360 type architecture an instruction usually cor- 
responds to two M2 cycles. In a sense the internal 
processor cycles should also be considered useful 
work. Indexing which does not require an M2 access 
because it might use an internal register is a 
very useful function. If a multiprocessor ma. es 
very large use of its internal Ml storage these 
cycles are just as important as M2 cycles in esti 
mating overall work. 


W = n p + n 2 


R = number of processing units 

M = number of independent M2 modules 

h = fraction of all memory requests that use Ml 

(the hit ratio) . This is for a single proces- 
sing unit. 


h = n l 

n i + n 2 
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M Modules 



Notes : 

1) A processing unit contains a P-Ml combination 

2) The internal bus allows all the R processing units 
to communicate with all M memory modules 

3) There is no internal bus contention 


Figure 4.1: Multiprocessor Model 
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j) 


It is assumed that a processing unit is always making 
an Ml or an M2 reference and that these references are 
mutually exclusive , that is they cannot occur simul 
taneously. Let n-,tg + n ? (t 2 f g) - 1 unit of time. 

From the above definitions it follows that 


W 


_1 r r • 

t 2 eff L h + (1 “ h) r ~ 


where 


r = t 2eff 

tg " 


The term in brackets can be considered to be an en- 
hancement factor by which performance is increased. 

Figure 4.2 plots this factor as a function of h. 

We see from this simplified model that the introduction 
of Ml with a reasonably high hit ratio can potentially increase 
the performance of a processing unit, especially if the t 2 /tg 
ratio is high. Many overhead factors, involved in the utiliza- 
tion and control of Ml will tend to lessen the improvement. 

The effect of memory contention upon t2eff wil1 now be 
calculated. Assume that requests to M2 are independent and 
randomly distributed across the address space. In reality this 
assumption can be seriously questioned since program and data 
both possess locality. That is, there is a stronq correlation 
between successive M2 access events. This is extremely diffi- 
cult to measure since the programming load is not known. For 
lack of a better model, the random distribution is assumed. 

A processor will request access to M2 with a proba- 
bility A = n 2 (t 2 eff ) / n l (tg) + n 2 ^2eff ^ * tt can s ^ own that 


A = r ( 1 - h) 

r(l - h) + h 


The probability of accessing any particular M2 modules is there- 
fore A/M. Given that a processor is requesting access to a 
particular M2 module, the probability that none of the other 
R - 1 processors are requesting access to that module is: 
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P ( 0 ) = (1 - A/M) 


R-l 


The probability that 1 out of R-l other processors is request- 
ing access to the particular M2 module is : 


P(l) 



* 

(1 - a/m) r “ 2 


(A/M) 


1 


In general, the probability that i processors out of the R-l 
other processors oesxre access to the same mocule isi 


P ( i ) = 


R-l 

i 


(1 - A/M) 


R-l-i 


(A/M) 


If there is no contention, the M2 access time is t 2 . If one 
other processor is requesting, the access time could reach 2(t 2 ) . 
In general, with i other processors the access time could reach 

(1 + i)t 2 . 

The effective access time averaged over all contention 
possibilities is therefore 

R-l R-l 

t 9 E P (i ) + to E iP (i) 

2 i=0 1=0 


Since Pi is a binomial distribution [1] 


R-l 

t 2e f f = E (1 + i) (t 2 )P(i) = 
i=0 


R-l 

E P (i) = 1 
i=0 


and 


R-l 

E iP (i ) = (R-l) (A/M) 
i=0 


* 



(R-l) 1 
i ! (R-l-i) ! 
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therefore 


t2eff 


1 + ( R— 1 ) A~] — to 

'l + (R-l) r (1-h) 1 

L M J 

M [r (l-h) + hi. 


Some insight may be gained by studying the overall total system 
work (VJrp) where: 


W T = RW 


Wrr = r Rr 

t 2 I [r (1-h) + h] + (R-l) r (1 

M 


-h)] 


The following figures (Figure 4.3 and Figure 4.4) 
depict W T for h = 0 , h = .5. The following two facts should 
be observed: 

a) System performance is increased as more M2 modules are 
added . 

b) Local storage can significantly increase performance. 


4-8 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-184C 


I 












4.2.2 


Two Approaches to an Implementation 

A major design question naturally arises. How does 
one use local storage to obtain a hit ratio of .5 or .9 or 
more? The answer is complex and involves studying the nature 
of program execution in relationship to the instruction set. 
Two approaches will be mentioned. 


4. 2. 2.1 The Cache Concept: As CPU speeds have increased with ad- 

vances in technologyT computers have been able to handle lar- 
ger and more complex processing tasks, and the demand for 
operating memory capacity has increased. Since capacity and 
speed are conflicting factors in memory design, an hierarchical 
memory organization was proposed many years ago [2] to enable 
these two desirable qualities to be independently developed. 
Advances in semi-conductor technology have only recently made 
this concept feasible. 

A backing store, M2, which de— emphasizes speed to 
achieve an adequate capacity, interfaces to a buffer store or 
cache. Ml, whose primary design objective is speed. 



Figure 4.5: Buffer Store Organization 
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The concept depends for its success on the notion of 
locality. Locality is an experimentally observed fact of pro- 
gram behavior by which references tend to occur within a re- 
gion of the program's address space, and this region migrates 
relatively slowly. Locality is a natural outcome of the way 
people think and write programs: concentrating on one task 

at a time, using loops, using sequential control, etc. [3]. 

The degree of locality is influenced by programming style, 
data organization, strategy of algorithm, and the programming 
language. Locality gives rise to the notion of the program 
working set, which is the minimal set of blocks that a pro- 
gram requires to have in the cache in order to run efficiently. 
If less than the working set is in Ml, the probability of oc- 
currence of a reference to a missing block, m, increases. This 
situation is most likely to occur in a multiprogrammed environ- 
ment when the number of programs n exceeds the capacity of the 
cache to contain all their working sets, as illustrated below. 


m 



Figure 4.6: Probability of Missing Block Versus 

Number of Working Sets 


It is an experimentally verified fact that a process 
favors references to a small set of its total address space, 
and that provided this set is contained within Ml , the need 
to access program areas not in Ml arises relatively infrequ- 
ently. When access to M2 becomes necessary, more information 
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than is .immediately required is transferred to Ml, in the expec- 
tation that references in the vicinity of the accessed word 
are likelv. The relationship between the size of Ml, the amount 
of information transferred, and the effect of different program 
addressing behaviors was studied by Gibson [3] . He concluded 
that an Ml capacity of 2K to 4I< words and a transfer block 
size between 4 and 16 words provided best results. He also 
found that the dynamics of buffer operation were more sensitive 
to the addressing patterns of the vairious programs than to any 
other factor. 

To maintain a given processor's speed, data transfer 
from M2 must occur at an adequate rate. The M2 1 s slower access 
can be compensated for by increasing the transfer path width. 

Tins can oe aenreveu ny : 

a ) An M2 technology which yields a long physically stored 
word, e.g. , the pseudo — 2 1/2D organized plated wire 
memory [4] which allows several hundred bits to be 
accessed at once. 

b) Organizing M2 into a number of smaller modules and in 

ter leaving the addresses, so that contiguous addresses 
1, 2, 3, are stored at corresponding locations in mod- 

ules 1,2,3, rather than in conseuctive locations in 
any one module. This has been the approach employed 
by current designs such as the IBM 360/85, 91 and 195, 
which use core technology for M2. 

The high speed of Ml is now generally realized by bi- 
polar semiconductor techniques rather than thin-film. Buffer 
memories of up to 1/4 million bits with cycle times less than 
200 ns have been built, although similar speeds at far lower 
power dissipations are being achieved by current plated wire 
designs [5 ] . 

The above discussion has been in terms of a processor 
"read" operation. Writing into the buffer presents an additional 
problem in that the contents of the buffer do not represent the 
primary source of the program being processed. A processor 
"write" must be reflected in an update of the primary source, 
which is stored in M2. This can be achieved in two ways: 

a ) Storing through: Every "write" request causes an 

immediate update of M2 as well as the cache. 

b) Block update: Write requests are allowed to accumulate 

in Ml. Whenever a block is to be replaced by the block 
replacement logic the modified block is written out to 
M2 . 


4-13 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 



Which of the two techniques is chosen depends on program beha— 
vioi : writes tend to cluster in time and in program space , 

so that for small blocks of 4 to 16 words a block update tech- 
nique may result in lower average Ml to M2 traffic density and 
transfer delay. 


There are a number of arguments which can be raised 
against the use of a cache in a multiprocessor system. 


a) 


b) 


c) 


d) 


Cost. To be effective a 4K word cache of high speed 
(100 ys) monolithic memory must be employed in each 
processing unit. 

To keep the cache filled with useful data a large band- 
width of data from M2 must exist (128-256 bits) per 
access. Many of the words accessed from M2 might not 
be used. This unnecessary M2 traffic tends to increase 
M2 contention and thus reduce performance. 

In a multiprocessor system the use of a cache with 
COMPOOL data presents a problem in keeping copies, pre- 
sent in the various caches, updated. (See section 2.3.1) 

IBM's successful use of the cache is based partially 
upon the inefficiency of the 360 instruction set. 

That is, quite often small program loops are inserted 
by a compiler to execute primitive functions which 
could have been basic instructions in other systems. 


4. 2. 2. 2 Ml in a Stac k-oriented, Descriptor-based System: The 

problem faced in employing Ml is to use it for information which 
has a high probability of being accessed many times (a high hit 
ratio, h) . Traditionally, base registers and index registers 
have been allocated to the local storage of a processor for 
reasons of speed and their high frequency of use. However, re- 
gister management problems tend to increase overhead. 

Intermetrics proposes to use Ml for specialized storage 
and to have the management of Ml an automatic hardware function. 


In a stack-oriented machine it was realized that the 
top few entries of the stack provide the most referenced ele- 
ments . For this reason the first 8 stack locations are made 
resident in Ml. Ml stack overflow pushes the bottom of the 
Ml portion of the stack into M2. 

The descriptor is the most referenced data type. For 
this reason the 32 most recently referenced descriptors are re- 
tained in Ml. An associative mapping mechanism is employed for 
control of this descriptor cache. 
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The dyne^mic nature of the stack creates a situation 
where the starting location of each lexical level must be 
quickly accessed. For this reason a set of from 16-32 base 
registers is proposed. Each base register contains a pointer 
to the start of each lexical level and is automatically acces- 
sed when addressing within the stack is desired. 

An instruction set which is organized around this 
machine tends to bo more complex than a 360 type instruction 
set. For this reason more time is spent accessing Ml and ex- 
ecuting micro code. This tends to nuke the duty cycle oj.. the 
processor higher than a 360 type instruction set, which in turn 
tends to reduce memory contention. 

The processors duty cycle and the parameter h are 
directly ire], a ted . 


D = duty cycle = 

ni(t 1 ) + n 2 (t 2e ff 


h = n l 
n l + n 2 


D = h where r = t 2ef f 

(1 - h) r + h 


4 . 3 Operating Memory and Memory Management 

The concept of a memory hierarchy, discussed in rela- 
tion to Ml and M2 /’can be extended to the relationship between 
M2 and M3. For large file oriented systems archival storage, 

M4 , is also considered. 

4.3.1 Background 

Since program and data can only enter the computation 
process via M2 one must control the flow of information across 
the hierarchy of memory. This control is the job of memory man- 
agement . 

Virtual memory is a technique for managing the utili- 
zation of memory in processing systems where program space 
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exceeds the actual operating memory space. The concept has evol- 
ved from the need to improve on early attempts to utilize limited 
amounts of memory by overlaying. This required the user to par- 
tition his program into pieces which fit into the available 
space, and then plan the sequence of execution of the pieces and 
control their reading into and out of operating memory. As pro- 
gram requirements grew larger than a few thousand words, this 
became a cumbersome task. To help the programmer, automatic 
overlaying (folding) techniques, by the operating system with 
compiler assistance, were developed. But eventually it became 
clear that a system should allow a distinction to be made be- 
tween address space, a set of identifiers used by a program to 
reference information, and memory space, the set of physical 
operating memory locations [6]. 

Since a program could be allocated any physical M2 loca- 
tions, the addresses contained within the program string must be 
relative and not contain any absolute M2 reference. A transla- 
tion mechanism must map the relative addresses into absolute M2 
address. Many machines employ a concatenation of the address 
field of the instruction with the contents of some specified base 
register. Other schemes employ a descriptor mechanism, which 
is used to provide an indirect reference. In either case, the 
relative address is first presented to a memory map mechanism 
which determines if the desired element is in M2 or whether an 
M3 fetch is required. Figure 4.7 indicates the basic operations 
involved in memory management. 

The memory mapping mechanism usually employs a limited 
associative memory to contain the most recently referenced ad- 
dresses. In the 360/67, the contents of the base register is 
funnelled thru an associative memory. In the Intermetrics' 
multiprocessor concept the descriptor's address field is trans- 
lated via an associative memory. 

The first suggestion for* achieving virtual memory was 
published by Manchester University in England, in 1962 [7]. 

Virtual memory has subsequently been implemented in a number of 
ways, most notably in systems designed to service a large user 
body generating an unpredictable load and mix of processing 
jobs (e.g., the HIS 6000 in the MIT Multics system, and the 
IBM 360/67). The mapping mechanism requires address informa- 
tion to be organized into blocks. Two basic schemes have been 
defined' for handling these blocks: 

a) Segmentation organizes address space into a collection 

of segments which are mapped into variable sized memory 

blocks 

b) paging organizes memory space into "pages" of fixed 

size . 
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Figure 4.7: Memory Management 
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4.3.2 


Segmentation 


Since segmentation is concerned with the modularity and 
structure of the program it is visible and controllable by the 
programmer, although usually indirectly through the use of a 
language. He determines the size of the segment, and attaches 
its name. Each segment may be considered as an independent 
virutal memory. Internal to each segment, addressing is rela- 
tive to the beginning of the segment, and thus becomes inde- 
pendent of addresses in any other segment. This property re- 
sults in what has been termed two-dimensional addressing: seg- 

ment number followed by location number. McKeeman [8] points 
out that this addressing structure is employed in a number of 
modern programming languages, such as ALGOL, PL/1 and FORTRAN. 

It is also a property of UAL. These languages use a pair of 
numbers to represent an address: the first number corresponds 

to the nesting level (lexical level) of the occurrence of the 
declaration of the name of the address, and the second indi- 
cates the occurrence of the name within that level in the pro- 
gram. The elements of a segmentation implementation mechanism 
are shown below: 


Segment Table 



Figure 4.8: Elements of Segmentation 
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Segments are loc . . ed by reference to a table, each 


entry of which is a segment descriptor defining the segment s 
base address a and its size b. The position of the descriptor 
in the table is S from the base of the table. A reference to 
an address in name (address) space is of the form S,W. The 
component S locates the desired segment's descriptor in the 
tabic. If it is not in the table (i.e., the segment itself 
is not in operating memory) a missing-segment trap occurs. 

The segment is then brought into operating memory from mass 
storage, and its descriptor is placed in the table. A test 
whether W > b is made to check if a programmer's reference is 
out of bounds of his own segments. Then the location a in 
physical memory to which the name space address S,W refers is 
formed by a' = (a + W) . This address translation mecnanism 
can be realized in special hardware, with a set of special asso 
ciatively addressed registers. Or the tables can be accomodated 
in operating memory, with all translations performed by mu .ip. e 
levels of indirect addressing. The latter approach involves 
two or more memory accesses per reference and results m a con 
siderable penalty. 


The segmented addressing scheme offers several attrac 
tions for a large and diverse software system such as the space 
station central multiprocessor. 

a ) Program modularity. Program modules are organized into 
distinct, separately named and controlled segments. 

b) Variable data structures. In a system such as the space 
station, the data base will contain large and com- 

plex data structures which will vary in size and content 
during use. By creating segments of such structures 
they may be assigned just the memory they require. 

Their manipulation is well controlled. 

c ) Protection. A high degree of access control can be 
provided by the segmented approach through indirect 
addressing coupled with access privileges which con- 
strain read and write operations within a given seg- 
ment . 

3) Program sharing. By enabling one physically stored 

module to be known in different address spaces under 
different segment names, it may be directly shared be- 
tween two or more users. This obviates the usual prac- 
tice of creating copies of multiply used routines, and 
consequently economizes on memory space. 
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4.3.3 


Paging 


Operating memory address space is divided into a number 
of equal sized pages. Each page is identified by the memory lo- 
cation oj. its first word. Words within the pace are referenced 
by word number w from the first word. A page is referenced by 
its position p in the page table. A virtual memory address, a, 
is equivalent to the pair p,w (in a similar fashion to segment 
addresses). The total number of active pages may not exceed the 
page capacity of operating memory. Those pages not being execu- 
ted are transferred to the next level of storage, thus realizing 
the concept of virtual memory. Since all pages are equal in 
size, replacement involves only the problem of finding the neces- 
sary equal-sized "holes" in operating memory. "External" frag- 
mentation of memory need not occur. 

Page availability is maintained in the page table. The 
Pj~.fi. ® n try in the table is the memory location of the page con- 
taining address a, where p = integer [a/Z] , and Z = page size. 

If the pth entry is missing, the page does not reside in memory, 
and must be fetched. This condition is referred to as a missing 
page trap. If the page is present, the referenced word is the wt h 
element of the page, where w = remainder (a/Z) . — 

Paging is attractive to the system designer as a tech- 
nique for physical memory allocation, because of the regularity 
equal— sized pages. It is attractive to the programmer 
because he is relieved of the concern of allocating physical sto- 
rage , and, indeed, need never exercise any direct control over 
the mechanism. 

A major design decision is the choice of page size. A 
large page, say over 1000 words, may result in a high proportion 
of unused page space, if natural program modules are smaller than 
the page size. This is referred to as "internal" fragmentation. 
With a small page, less than 10 words, an overhead problem arises 
due to the large number of pages that must be controlled. The 
best page size is determined by: 

a) Program locality 

k) The speed ratio between memory hierarchy levels 

Paging cannot achieve some of the advantages of segmen- 
tation that were identified previously, because page boundaries 
bear no natural relationship to program content. Segmentation, 
the other hand, lacks the advantages of a fixed size. It re- 
quires the availability of contiguous regions of space, of suffi- 
cient size to contain the segment. The problem of searching for 
and/or creating variously sized "holes" in memory is a much more 
difficult task than matching pages to page spaces. 
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It is natural to contemplate a combination of the two 
mechanisms in order to realize both their advantages. The large 
Multics system at MIT has been the only example of a heavily de- 
veloped segmented and paged memory management scheme. 


4.3.4 Implementing Virtual Memory 

When implementing a virtual memory system a number of 
properties arc desired to minimize overhead. 

An efficient memory map search. This is usually achi- 
eved by employing a limited associative memory to hold 
the mos t recently used page or segment descriptors . 

An efficient M2 space allocation algorithm. 

An efficient determination of the M3 address in the 
case of a missing page or segment trap. The utiliza- 
tion of a descriptor containing an M2 or M3 address 
depending upon the state of the presence bit, is con- 
venient. 

One must attempt to minimize fragmentation of memory 
into small unusable portions. A memory compaction al- 
gorithm might be required. 

One must minimize the possibility of overloading the 
system to the extent that thrashing occurs. Thrashing 
is a state which is reached when memory management be- 
gins spending all its time moving pages or segments in 
and out of M2 and overlaying pages or segments in use. 
No time is left for processing applications programs. 
Thrashing can be minimized by providing sufficient M2 
and by keeping the unit of memory management small. 

Figure 4.9 indicates Intermetrics ' approach to memory 
management via a descriptor-based, stack-oriented structure. 
Absolute M2 addresses are only contained in "Mom" descriptors . 
Only 1 "Mom" segment descriptor for a program or data segment 
may exist. Many "Copy" descriptors may be created with a 
pointer to the "Mom". This pointer is a two-dimensional address 
specifying a stack number and offset (SNO) . The SNO is the re- 
lative address which must be translated into a physical M2 ad 
dress. The 32 most recently referenced (SNO) addresses are con- 
tained in the associative memory. The contents are updated 
automatically whenever a reference is made. If the SNO refer- 
ence is found within the associative memory, the "Mom" descrip- 
tor which contains the absolute M2 address is retrieved from 
local storage (Ml) and the operating memory address is obtained. 


a) 

b) 

c) 

d) 

e) 
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If the associative memory does not contain the refer- 
enced SNO, then a three level indirect addressing sequence 
thru M2 is executed. The first level fetches the stack pointer 
from within the stack vector using the stack numoer as the re- 
lative address from the base of the stack vector. The second 
level of indirection is used to fetch the "Mom" descriptor 
using the offset part of the address as the positioning rela- 
tive to the base of the stack. 

If the referenced segment is not in M2 it must be 
fetched from M3. This is indicated by a "presence" bit con- 
tained in the "Mom" descriptor. If the segment is present 
within M2 it is referenced directly. In either case the as- 
sociative memory is udpated so the "Mom" descriptor can be 
referenced more directly the next time. 
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Chapter 5 
ADDRESSING 


The question of addressing is the most dominant feature 
in the diversity of instruction architectures. It can be viewed 
from many different angles: correspondence to s o f tw are, ease of 

usage for programmers, bit minimization, physical implementation 
and execution , iircmrchios or memory , and/or operating system 
memory resource allocation. We shall discuss several of those 
aspects and show various options or methods that may be employed. 


5 . 1 Addressing and Instruction Architecture 

When an instruction architecture is contemplated sev- 
eral different independent decisions with regard to addressing 
within an instruction must be reached. The number of operands 
which an instruction can contain may vary from three, two, or 
one explicit operand (s) to implied operands, where the implied 
operands tire to be obtained from a stack. The question as to 
how many hardware registers, of what type, and how they are to 
be addressed arises (single accumulator or "general" register , 
hardware "top of stack" for a depth of two, ...). Finally, 
exactly how is memory to be addressed: all memory addressable, 

two-dimensional addressing,, self-relative, etc. 


5.1.1 The Number of Operands in an Instruction 

Most operations which occur in algebraic languages are 
dyadic operators. That is, the operation manipulates two inputs, 
transforming them into a new output value. It is seen that dya- 
dic operators (+, t , x, ... ) have three operands: two input 

operands and one output operand. There are, of course, monadic 
operations such as negate or absolute which have two operands: 
one input operand and one output operand. 

Instruction architectures vary as to the number of ex- 
plicit memory-addressed operands which appear within the instruc- 
tion, yet, of course, the necessary three operands for dyadic 
operators must be present. (Two operands for monadic operators.) 

Three memory operand instructions are found in several 
machines including the Honeywell 800/1800 series. However, 
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when the actual usage of dyadic operators is examined it is seen 
that seldom are three different memory addresses needed. Consi- 
der for example: 


A = B ; 

A = A + 1; 

A = - B/C 

In these, admittedly biased, examples the use of the three mem- 
ory address operands is wasteful. In the first example, there 
is but one input and one output. In the second, one of the in- 
puts is also the output address, and in the third a monadic op- 
erator appears . 

The waste, or non-use, of a memory address is only bad 
in so far as it takes room. If the instructions are of the 
three-operand form and not all three operand memory addresses 
are used, the instruction still must save space for the presence 
of these memory addresses which are many bits in length. It is, 
therefore, usually found advantageous to have at least one of 
the three operand addresses implied. 

Two memory operands are occasionally met with in the 
instruction architecture. In this case one of the two operands 
besides being an input is usually also the output operand. The 
IBM 1401 is such an example. This form of two operands can be 
very useful where most of the operators are monadic such as is 
commonly found in data processing where much of the computer 
time is spent in moving data and editing them. 

The most common architecture found is based upon single 
memory address operand instructions. This is common in both the 
second and current third generation computers such as the IBM 
7090, IBM 360 series, Univac 1108, and the DEC PDP-10. With the 
single memory address operand an accumulator (or another "regis- 
ter") becomes an implied operand for the instruction. Commonly 
then the implied operand serves both as one input operand and the 
output operand of a dyadic operator. When monadic operators are 
used one operand can be the memory address and the other the im- 
plied accumulator. When the third generation of computers de- 
veloped, the "implied" accumulator was often made into a set of 
general registers of which one could be selected to be the ac- 
cumulator. This has led to the characterization of the 360 as 
having a 1.5 operand instruction set. 

The single memory address form of instruction is very 
useful when sequential accumulation of results occurs, such as 
in: 

A=B+C+D+E; 
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However, if a tree structure form of computation is needed (as 
commonly occurs) such as: 

A = (B + C) * (D + E) ; 

the accumulator would have to be saved after calculation of 
B+C before D+E can be done. One of the hopes of the general 
registers development in one third generation computer with 
multiple accumulators was to be better able to do efficient 
calculations of this form (i.e. f save on storage to memory for 
temporaries) . 

One of the principal advantages of having fewer mem- 
ory operands with each instruction is in the space savings to 
be’ found by not having useless fields m all instructions. 

That is, it would be desirable to use instruction space for 
memory address operands only when they are needed. The ulti 
mate in this form of space savings is to be found in the zero 
memory address operand instruction. In this case all of the 
necessary operands for an operator are implied. These are the 
stack machines where the "top of the stack" provides the nec- 
essary number of operands for an operator and the resultant out 
put value is in turn placed upon the stack. The Burroug s 
B5500 and B6700 are examples of such machines. The memory a 
dress operands, of course, must be able to be fetched from mem 
ory and stored into memory. These are, in effect, merely wo 
forms of operands. 

This stack form of instruction is one of the most ef- 
ficient ways in which to specify an algorithm since only the 
minimum amount of information needed for execution need e 
present. 

The stack itself can be considered in several ways. 

From a IIOL point of view the implied operands of the stack cor 
respond to many of the parse algorithms which have been developed 
for compilation and hence are able to produce extremely effi 
cient code. -From a multiple register point of view the stack pro- 
vides a method for the dynamic assignment of the general regis- 
ters rather than the static assignment at compilation time with 
its inherent inefficiencies. 


5.1.2 Single Accumulator and General Registers 

While many second generation computers had a single 
accumulator, third generation machines have tended to have a 
set of general registers. This has come about for several dif 
ferent reasons. Each reason stems from the basic desire for 
more efficient and quicker execution. As was seen above, a 
single accumulator does not make for efficient execution o 
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tree structured statements. Therefore, if several accumulators 
were available, storing into memory for a temporary could be 
avoided; this would save both time and space since memory would 
not have to be referenced. Also technology, by the third gen- 
eration, had improved to the degree of allowing more complex 
hardware in the processor. Thus, multiple accumulators could 
be implemented. 

Another aspect is invoked with the addressing of mem- 
ory. Second generation machines often had separate index re- 
gisters from the accumulator; these then needed a separate set 
of instructions for their manipulations and similarly they were 
then restricted in the operations which could be performed on 
them (e.g. , no multiplications with an index register) . The 
third generation often has truly general registers which can be 
either accumulators or index registers (or base registers) thus 
optimizing on the resources of the speedy registers foz' use as 
needed. 


The desire to use more accumulators was based on the 
desire to improve the speed of computation by having fewer mem- 
ory references and by doing manipulations and operations with the 
general register set. Unfortunately, this very desire forces 
the ins troduction of bookkeeping instructions to set up the re- 
gisters so that they can be manipulated. It is often difficult 
to tell from instruction occurrence statistics for an IBM 360 
whether the large number of loads (L, LH , IC) used are to keep 
the register policy happy or are rather a by-product of improve- 
ment. 


When both base registers and index registers are avail- 
able their usage is often confused. Base registers are primarily 
used to address physical locations. They provide the capability 
of addressing particular regions of core. Their value interpre- 
tation is that of a physical memory address. Index registers 
are used to locate an element within an ordered data structure. 
They refer to data elements which are to be manipulated and do 
not inherently indicate physical addressing. If a character 
array is being indexed, then the elements are in byte units, if 
word integers are being referred to, the index actually refers 
to four byte quantities (in the 360) . Because this distinction 
is not maintained the automatic quality of element indexing can- 
not in general be performed. (In the 360 the SLL instruction 
proliferates in order to align the "index" properly.) 

One other major problem can develop with the use of a 
set of general registers. This is the question of how to opti- 
mally use them. A choice has to be made as to which registers 
are to be used for accumulator (s ) , base register (s) or index 
registers. The static assignment of the use of the registers 
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unfortunately docs not often correspond to the optimum dynamic 
usaqe. This is the case since the flow of control through 
an executing program is simply not known. Alternate paths 
of execution exist since this is what the execution of an algo 
rithm is about. The use and savings of registers for one branch 
of an IF ... THEN .. .ELSE statement is, in general, entirely dit 
ferent from that of the other branch. Similarly when a sun- 
routine is entered, the use of registers within the subro me 
is n ot correlated with use in the calling procedure (who- ne 
CALL may be issued at distinct locations each with diffc.: 
r c g i ft Lo r us a y o ) . 

While it could be truly argued that in any case mul- 
tiple registers arc better than one, the actual policy xmple- 
1 •' - - i nof f i ci onn or 


UltiU Let UJLUii ui 




Above, it was seen that in the zero memory address op- 
erand form of instructions a push down stack mechanism is used 
for operands. This, by its very nature, tends to optimize the 
usage of the hardware registers available for accumulations and 
index registers. When a subroutine is entered, the dynamic en 
vironment stack continues to push and pop as needed for the sub 
routine and hence it acts as an automatic dynamic optimization. 


When optimizing is tried 
to identi. fy common sub-exparessions 
the stack can become inefficient, 
and space to save and restore I + 
actual recalculation of 111.) 


in code generation by trying 
(e.g., I + 1 in Aj + - l = . B I+1^' 
(The code needed both in time 
1 is (can be) more than the 


5.1.3 How to Address Operating Memory 

Various methods of addressing physical memory are found 
in instruction architectures. All of memory may be addressed, 
a bank of memory may be addressed, addresses may be relative to 
the executing instruction or, while only a small portion of mem 
ory may be directly addressed, the rest could be addressed in- 
directly" . 

Machines such as the IBM 7094 addressed all of memory. 
This form of addressing implies that the memory address operand 
must have the number of bits needed to represent all of memory. 
Not only is this wasteful, since usually only a small portion 
of memory is needed in the execution of a program, but it also 
limits the size of memory which can then be used with the 
structions . 

In order to both reduce the size of the memory address 
operand field and to remove the restriction on memory size 
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(or at least increase the limit beyond foreseeable needs) , two 
dimensional addressing is introduced. This addressing can be 
in fixed banks where a certain block of memory is in use (in 
the Apollo Guidance Computer there were four "banks" address- 
able at any given time: fixed erasable, fixed fixed, banked 

erasable and banked fixed) and hence memory address operands 
then refer to addresses within the current block, or a more 
dynamic form of banking can occur as in the 360 where a base 
register points to a starting location and a displacement field 
then refers to an offset from the base. 

Thus with the use of 16 bits, 4 bits to indicate base 
register and 12 bits of displacement, the IBM 360 is able to 
address up to 24 bits (16 megabytes) of memory. The penalty, of 
course, is the overhead which must be paid in the setting, us- 
ing, and maintaining of a base register and the restriction to 
a maximum displacement of 4K bytes in a program segment without 
the setting of another base register (or the resetting of the 
current base register) . 

Another form of two dimensional addressing appears in 
those computers which have been designed for the execution of Al- 
gol (e.g., B6700) . Since the instructions to be executed are 
reflective of Algol, the data referred to must reflect the name 
scope restrictions of Algol. The B6700 makes effective use of 
the name scope restrictions in Algol to have its "base" regis- 
ters (i.e., Burroughs Display registers) set automatically to 
the dynamic environment of the addressable data. The B6700 "base" 
register points to each succeeding lexical level which is ad- 
dressable within name scope rules. The displacement then refers 
to a particular entity within the lexical level. 

Besides having base registers, as in the 360, which are 
able to address any region or core, many architectures allow "in- 
direct" addressing. By referring to an address word which is 
within the area which you can address, you are allowed to "indi- 
rect" your reference thru this address word to what it points to. 
Thus, while only a small portion or memory may be "directly" ad- 
dressable, all of memory becomes addressable. 

It is apparent that when the 360 was designed, the in- 
crease to 16 general registers from one accumulator and a few 
index registers seemed so magnificent that the need for indirect 
references was deemed not necessary. (The 4ir AP-1 which is a 
flight computer by IBM modified from the 360 instruction set has 
restored indirection.) It turns out that the use of a few in- 
direct references could save immense overhead on register usage 
and allocation. 

When data is being addressed, the actual number of en- 
tities (variable "names", e.g., A , B ... in a program) involved. 
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in general , is small. This comes from the simple .Limitation of 
the human programmer. 'The amount of storage, however, may be 
large (e.g., arrays of data, single or multiple dimension). 

K'hcn an element in an array is referred to, an index is used. 
This phenomena has the very nice property of making the base- 
displacement form of addressing attractive. While entries can 
be directly addressed, arrays can be indexed into. The number 
of different data areas are also generally limited, again due to 
programmina language restrictions and conventions and hence the 
number of different data regions is in general small and there- 
for.: the number of base registers for data address.! ng is in 
general not too large. 


characteristics. Often a rou- 
displacement allowable with 
register. Addressing of a code 
concerned with control flow and 
This brings one final form of 
elf-relative addressing. Often branches occur to 
one instruction, or a few as in an If . . .THEN. . .ELSE. 


Instructions have other 
tine will far exceed rhe 4k byte 
IBM 360 addressing from one base 
segment within a code segment is 
usually has a very local, nature, 
addressing: 
simply skip 


By using self-relative addressing for control flow within an in 
struction stream a very high degree of size compaction can oc- 
cur; it becomes automatically relocatable without changing any 
code and the restrictions (e.g. , 4K bytes per base register) of 
the code segment length can be removed. 


5.2 The I BM 360 and Burroughs B6700 

In order to gain an appreciation of the difference in 
addressing structures, a comparison between the IBM 360 and the 
B6700 is given. 


5.2.1 Two Dimensional Addressing (Static and Dynamic) 

In order to process large computational jobs a large 
amount of addressable space is needed, but with a second gen- 
eration machine such as the 7090 all of this space (and hence 
the limit of the memory size) must be addressable. In this 
case then, it was necessary to use 15 bits in every operand ad 
dress. The IBM 360 and B6700 both have two dimensional addres- 
sing. The IBM 360 uses a 12 bit displacement which is to be 
added to one of 15 base registers. This allows for a full 24 
bit addressinq (of bytes) scheme. Here 24 bits of address space 
has been compressed into 16 bits of information. The 36700^ 
scheme uses only 14 bits with its operands, where the base 
(DISPLAY) register is defined, with only the number of bits 
needed to indicate the current lexical level U£) (i.e., 
implies 13 bit displacement, 11=2 implies 12 bit displacement) 
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and the B6700 displacements refer to "words". Since program seg- 
ments in the B6700 are described via a "descriptor", the actual 
size of memory which could be addressed is only limited by the 

numbers of bits so used in the descriptor. In point of fact, 

Burroughs uses 20 bit word addresses in their descriptors. 

It is easy to see then that if the memory of a compu- 

ting system is large compared to the modular size of "programs" 
(or perhaps even procedures and routines) , program string sav- 
ings are to be found by using a two dimensional address. 

There; is a great difference, however, between the IBM 
360's and B6700's tv.’o dimensional addressing schemes. The IBM 
360 base registers axe assigned "statically" at compile time, and 
it is up to the compiler to try and optimize base register usage. 
This optimization is minimal if only one base register is needed 
within a segment. This becomes difficult in large segments since 
the dynamic characteristics of the segment modularization must 
be considered. 

This static two dimensional addressing of the IBM 360 
has several aspects. 

a) By using 4 bits everywhere for base registers the dis- 
placement range is reduced, since seldom are that many 
registers desirable. 

b) If a program is "one big" segment, then several base 
registers are needed and segment boundaries must be 
carefully watched. 

c) If the base registers are set upon entering and upon 
returning to each module then: 



1) 

There must be code to do this in the program 
strings. 


2) 

Name scope problems arise when variables in a 
previous level are to be addressed since their 
base registers are in general no longer in exis- 
tence . 


The 

B6700 optimizes upon the two dimensional address 

idea by- 



a) 

using only the number of bits necessary for the current 


lexical level to indicate the number of bits for the 
"base register" . This leaves the rest of the bits for 
displacement. (There is also the fortuitous circum- 
stance experienced by all, that the more "inner" a sub- 
routine the "smaller" it is, i.e., it needs less dis- 
placement to fully address it.) 
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b) The base registers point at the beginning of each dy- 
namic module ,* hence allowing the displacement to reach 
its most extreme logical dynamic range. 

c) Since the usage of the "base" (display) register is 
unique and well defined, (versus general, e.g., base 
register, an accumulator or an index register) the 
.initialization and resetting of them can be accomplished 
automatically. Furthermore, no explicit code in the 
program string is required and current dynamic name 
scope is maintained. 


5.2.2 Implicit Addressing 

Compare the expression: 

A = B + C; 



on the B6700 versus the IBM 

360 : 


— 

B6700 

IBM 360 



VALC C 

L R0, A 



VALC B 

A R0, B 



ADD 

ST, R0, C 



NAMC A 



— 

STOD 



— 

In each case they execute similarly: (fetch C) , (add B to this 

value) and (store value into A) . In effect it is the only se- 
quential form- possible (i.e. , ADD before STORE) for this expres- 



si on . 



— 

However, when temporary locations become necessary a 
difference appears in the code, although the total effect, must 
of course, remain the same. Consider A = (B + C) * (D + E): 


B6700 

IBM 360 



VALC C 

L R0, C 


— 

VALC B 

A R0, B 



(continued) 
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B6700 

IBM 360 


ADD 

ST R0, TEMP 

— 

VALC E 

L R0, E 


VALC D 

A R0, D 


ADD 

M R0, TEMP 


MULT 

ST R0, A 


NAMC A 



STOD 




Assuming that there are only a few (in our case exactly one) ac- 
cumulators being used, during the expression evaluation it becomes 
necessary to create a temporary. 

The creation of a temporary indicates an increase in the 
program size for two reasons. 

a) In general, the use of temporaries is a static decision 
and hence cannot behave better than the dynamic usage 

of the stack. Therefore, one needs more "temporary sto- 
rage" locations than stack storage. 

b) But more importantly, in the IBM 360 type of machine, 
every instruction has an operand, therefore, the tem- 
porary requires an address which in turn takes space. 

The B6700 uses implicit addressing; the needed number • 
of operands coming from the appropriate number of loca- 
tions on top of the stack. 

1 

When temporaries are needed, most often an implicit ad- 
dress scheme allows for the savings of "temporary" operand addres- 
ses . 


5.2.3 Descriptors 

Descriptors can be considered either as sub-operators 
or as the ideal data structure which is being manipulated. When 
considered in the first manner, it is seen that the descriptor 
saves on the program string length. "Fewer" operators need be 
specified since the "sub" part of the operator is found in the 
descriptor of the data structure. For example, the IBM 360 has 
for "add": 
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AH, A(R), AL, AE(R), AD ( R ) , AU(R), AW(R), AP 


while the B6700 has simply "ADD". This of course requires fewer 
opcodes, and in turn fewer numbers of bit states for the neces 
sary operators. 


When the descriptor is regarded as the "data structure , 
it shows at least two virtues. One is the fact that by being 
"semantically concise" (further discussed below) it places into 
one location the complicated description of the data structure, 
which thereby need not be repeated in multiple references in 
the program. The other is the observation that the number of 
entities which are manipulated by a program are few. The reason 
that a large addressing space is normally necessary is that if 
the machine does not have descriptors, then each "memory cell of 
the data structure must be directly addressable. The example o 
an "array of 100 scalars on the IBM 360 is in fact 100 memory lo- 
cations. On the B6700 it is one entity: a descriptor which in 

dicates the dimensions of 100 and where it is to be found in 
physical core. This very important phenomenon reduces the ad- 
dressing requirements of a program string, since the full physi 
cal memory address need only appear in the descriptor. The des- 
criptor becomes one of the "few" entities which must be addressed 
and hence only a small address field is needed in the program 
string proper. 


5.2.4 Type Differences 

Descriptors allow any information which can be bottle 
necked" to be placed in the descriptor once, instead of having 
the information repeated throughout the program string. 

Besides having character data (for I/O) and an inter- 
nal arithmetic form, most machines have in fact several internal 
forms. The difference between the "character" and "internal 
arithmetic" comes largely from the savings yielded by compactly 
storing and manipulating them in the internal form. The various 
internal forms come from considerations of preciseness. 

Types can be optimized by: 

a ) making one a proper subset of another (e.g., integer is 

a subset of single precision floating point on the B6700) . 
Thus, the difference between the operators disappears 
(except for an explicit operator to recover the proper 
subset; such as INTEGERIZE) . 

k) the need for multiple forms of the same operator dis- 

appears (e.g., IC, LH, L, LD , LE) 
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c) 


and the need for explicit type conversion operations is 
reduced. The program string could be further minimized 
by providing an explicit operator for each type conver- 
sion when needed (e.g., scalar to character, while in- 
tegers to scalar would be implicit by the integer de- 
finition as a subset of scalar) . 


5.2.5 Semantic Conciseness 

Probably the most powerful way to save in program string 
length is by having semantically compact operators. By having 
the operator correspond to the operations indicated in the pro- 
blem language being executed, the minimum amount of translation 
is needed and hence the minimum amount of expansion in the pro- 
gram string. 

The Burroughs B6700 is an "Algol" machine. Its opera- 
tors are those that ALGOL indicates. 

The IBM 360 is semantically concise only to "BAL" which 
is merely stating a tautology. The IBM 360 is not semantically 
concise to any real "problem oriented language". 

Besides being semantically concise with respect to the 
operations needed for a problem the operators can be "semanti- 
cally concise" in the way in which they are constructed. Branch- 
ing occurs within a program under execution and not logically 
with respect to all of physical memory. The IBM 360, as most 
machines, allows the branch address to be any address of physi- 
cal memory. The B6700 uses relative addressing (that is, re- 
lative to the program under execution) either in the same or dif- 
ferent segment. This of course reduces the address space neces- 
sary, since it corresponds to the dynamic space involved at ex- 
ecution time. The RC4000, although built upon similar concepts 
as the IBM 360, has relative addressing, and this in turn creates 
an efficient and small (er) addressing need. 

In the IBM 360 each memory reference instruction gen- 
erally carries 4 bits of indexing information. The B6700 in- 
dexes only when needed, and since a stack is used (hence impli- 
cit addressing) only an 8 bit operator is needed (which can 
also load the resultant indicated entity) . Assuming that not 
every memory reference needs to be indexed (the indices them- 
selves must be fetched from memory) the use of indices when 
needed, (and semantically concise operations make the need less) 
will, in most every case, minimize the program string length. 

The use of short literals also compresses the program 
string since the constants used are usually small integral 


5-12 


INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 



values. Recognition of this fact allows for their representa- 
tions in the amount of space needed and not the amount foi the 
worst (largest) case possible. 


5 . 3 Impl ementat io n Aspects of a Stack Machine 


5.3.1 Definitions 


The stack pro 
cit addressing can be 
and efficient manner, 
with the stack will be 

i jitp lejiten Lf.A e j. on is jjjn.s 

machine. However, the 


vidos the mechanism through which impli- 
accomplished in a semantically concise 
The control sequencing and addressing 
discussed in this section. A specific 
anted. Details can vary from machine to 
fundamental ideas will remain the same. 


In a sense the stack is a hardware element just as the 
arithmetic unit is an element. It can execute three primitive 
commands : 


a) PUSH 

The PUSH command will take the contents of the stack 
buffer register and place it on top of the stack. Sim- 
ultaneously it will shift all other elements of the 
stack down one level. For example, the old top of the 
stack becomes the second entry in the stack. 

b) POP 

The POP command fetches the top of the stack and places 
it in the stack buffer register. Simultaneously, it 
shifts the contents of all other elements of the stack 
up one level. For example, the old second entry of the 
stack becomes the new top of the stack. 

c) S tack Fe tch 

PUSH and POP store or retrieve information from the 
top of the stack. In many instances, information is 
desired from other stack locations. The Stack Fetch 
sequence accomplishes this function by fetching from 
the stack location (indicated by the lexical level 
and displacement) and placing the information in the 
stack buffer register. Stack Fetch does not change 
the state of the stack in any way. 

One could implement the stack as a word parallel-shift 
j^ggigter. This would fix the length and make it a specialized 
element of the computer. In order to achieve generality and 
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flexibility in the design, we choose to implement the' stack by 
employing a standard linear memory array with some specialized 
pointers. These elements are manipulated by micro code to 
create the three control sequences. 

In general, the length of the stack can vary during 
execution from ten's of words to thousands of words. For this 
reason the bulk of the stack must, due to practicality, be con- 
tained in M2. However, the more dynamic part of the stack (we 
choose 8 locations) can be placed in Ml for faster access . For 
the purpose of the following description the stack word size and 
M2 word size are assumed to be the same. 


5.3.2 PUSH 

The PUSH sequence, whose flow chart appears in figure 
5.1, involves both the Ml and M2 portion of the stack. Figure 

5.2 depicts these two portions and provides definitions of the 
various pointers used to control the stack. 

The Ml portion of the stack can be pictured as a 
wrap-around shift register. The oldest data is pointed to by 
M1SL (Ml Stack Limit) . The first empty location is pointed to 
by M1T0S (Ml Top of Stack) . Whenever MlTOS = MlSL, namely the 
Ml portion of the stack overflows, the contents of (MlSL) is 
moved into M2 location indicated by M2T0S (M2 Top of Stack) . 

If M2T0S ever equals M2SL (M2 Stack Limit) then the M2 part of 
the stack has overflowed and a trap is generated. The stack 
overflow trap routine could then, depending upon conditions, 
allocate more storage for stack use and change M2SL. 

The data to be entered into the top of stack is con- 
tained in MlBR (Ml Buffer Register) . Upon entrance to the 
routine MlTOS is compared with MlSL to see if the Ml portion of 
the stack has overflowed. If it has not an Ml write is set up. 
The Ml address is MlTOS and the data is contained in MlBR. 
Finally, MlTOS is incremented, modulo 8, before the exit. 

If the Ml stack overflows a determination is made as 
to whether the M2 part of the stack will overflow. If so, a 
trap is entered. If not, an M2 write is set up, using the M2 
address indicated by (M2T0S ) and the data pointed to by (MlSL) . 
MlSL and M2T0S are incremented, followed by the Ml write set up. 


5.3.3 POP 

The POP sequence is shown in Figure 5.3. If the Ml part 
of the stack is empty, an Ml stack underflow condition exists 
and a read from M2 must be initiated with an M2 address of 
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M1T0S 


Figure 5.2: The Stack 
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(M2T0S) - 1. The M2 words are placed into MlBR . On the other 
hand, if the Ml stack -.is not empty, the contents of (M1T0S) - 1 
is road and placed into MlBR. 

If the stack becomes empty the condition is set into 
SRI for input into the next POP sequence. Every PUSH sequence 
will reset this empty condition. 


5.4 


Effect i v e Ad d r c • s s Gener at ion (EA) . (Lexical Level 

Officio t Addressing) 


Within the instruction archi tccture of a stack orienteo 
machine there exists a class of instructions which refer to infor- 
mation within the stack. Whenever one of these instructions is 
encountered an effective address (EA) must bo calculated. The 
sequence to be presented depicts a specific design [1] - In gen- 
eral, the details of EA calculation might be different. However, 
some form of addressing with the stack must be provided. 

The format of the class of instruction referencing ex- 
plicitly the stack is: 


# of bits 12 5 


The address couple A2 1 | A3, forms a 13 bit field. Aq2 / A ll' A 10' 
..., Z\q which is interpreted as follows: 

a) The lexical level indicator, ll , is the key to the in- 

terpretation of A2 | | A1 . The first step is to find the 
positive integer m, where: 

2 m_1 < ll < 2 m 


contents 


op code 


A 2 


Al 


b) Form Field 1 where 

Field 1 = A 12 ' A i3- m 

c ) Fetch from Ml the base register specified by Field 1. 
Denote this base register by BRm. 

d) BRm is in Stack Number, Offset representation. 
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Ficrure 5.3: POP 






e) 


Next Field 2 is formed 


Field 2 - A 12 _ m , A 11 _ m , ... A Q 

f) Finally the effective address (EA) is formed where 

EA = (BRm) + Field 2 


This addition only occurs to the offset portion of 
(BRm) . 


5.5 Stack Fetch 


When information is required from any location except 
the top of stack, a stack fetch sequence must be executed (see 
Figure 5.4). 

The main test to be performed is to determine whether 
the information to be fetched is in the Ml or M2 part of the 
stack. This is accomplished by the calculation of the displace- 
ment DISP. Information is then read from either Ml or M2 and 
placed in the M1BR. 


Reference for Chapter 5 

1) Intermetrics, Inc., "Final Report — Engineering Study 

for the Functional Design of a Multiprocessor System' , 
Prepared Under Contract’ NAS9-11745 , Septemoer 1972. 
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Figure 5.4: Stack Fetch 



Chapter 6 

I/O CONSIDERATIONS 


6.1 


S p a ce_ Station 


System Requ irements 


tral com 
Lx on in 


The I/O interface of the computer which serves tne cen- 
eutational and control element of the manned space sta- 
iikoly to bo characterized by the following observations: 


There will be a large number and variety of interfaces 
with diverse avionics equipments. The recent Phase B 
Space Station analysis has advocated the use ox a time- 
shared, high speed' (10 MHz) avionics data bus to sim- 
plify the problem of meeting this requirement. The I/O 
implication of such a data bus will be discussed in this 
report. 

The computational speed and storage capacity requirements 
of the Space Station are such as to make the multiplex 
ing of operating memory an attractive economical propo- 
sition. (The cost of storing one bit in a core or 
plated wire memory is over one thousand times the cost 
of storing it on a disk.) Until the more exotic , non- 
moving media, secondary storage technologies (such as 
magnetic bubbles) become fully operational, the more 
conventional magnetic drum and disk will probably pro 
vide the mass storage capability on the early space sta- 
tions. The relatively long access time of these devices 
has made it necessary to. treat the problem of getting 
information in and out of them as an off-line task in 
parallel with the main computational functions. This 
chapter wi 11 discuss the use of a drum or disk as the 
tertiary level of a memory hierarchy and as the pri- 
mary storage for files. 

Although the Space Station central multiprocessor will 
possess the powers of a typical large ground based com- 
puter facility, it is not anticipated that its work load 
will encompass as wide a variety of jobs, languages, or 
users. Perhaps of even more importance, the work load 
will be much more predictable. This is certainly true 
of the operational requirements, and even the eventual 
experimerntal support function will probably be farily 
carefully tailored to the available facility. The 
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implication of this for the I/O function is that there is 
less need for a highly generalized interface to a wide 
variety of the conventional peripheral equipments, and 
much less need for the sophisticated data management faci- 
lity usually found in the I/O hardware and software for 
controlling these peripherals and providing for the or- 
derly management of a large number of files. It will be 
assumed that the only need for standard peripheral I/O 
channels in the planned SUMC MP will be to satisfy the 
needs of a laboratory environment (e.g., card reader, 
line printer, operators' console), and that the eventual 
operational I/O will be performed almost entirely through 
the avionics data bus. 

d) The emphasis on the generation, processing and record- 

ing of large amounts of data from experiments places 
the high density, high speed tape store into a special 
category of space station I/O device. Even if an impro- 
ved bulk storage technology is eventually employed in 
this function, the need for transferring and retrieving 
large blocks of data from archival storage at rates on 
the order of several million bits per second will still 
have to be met. This data originates at the experiment 
sensors, and enters the system for processing and reduc- 
tion via the main data bus, which, as will be seen, can 
typically supply 2.5 million information bits per sec- 
ond. It is felt that a more specialized interface than 
just another port on the bus is required for this I/O 
function . 

The major impacts of these observations on the I/O hard- 
ware and software will now be discussed. 


6 . 2 Data Bus I/O 


In order to make more than sweeping generalizations, some 
assumption of data bus characteristics must be made. Studies to 
date [1] have shown that an initial Earth Orbital Space Station 
can be serviced by a data bus whose elements are shown in Figure 
6.1, and which has the following typical characteristics. 

Multiplexing TDM 

Frequency 10 MHz 

Number of devices (stations) 256 

Command structure . Command/response 

These are the important control characteristics from the point 
of view of I/O communication. 

Command/response implies central computer control. Bus 
I/O takes place only on the behest of the computer; no device 
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may volunteer information. It is our opinion, however, that 
although a strict C/R control policy may be shown to be quite 
adequate at this stage of Space Station development, it will be 
advantageous to provide a bus interrupt capability. This is 
not so much in order to provide the devices with control iiuth- 
ority, but rather it is in order to allow the bus control unit 
(BCU) the ability to off-load the computer I/O routines of chores 
such as error monitoring, detection of unusual conditions, response 
to unsolicited communication from Station subsystems, etc. 

Local processing at the device level has been proposed to 
off-load from the bus any high speed repetitive functions (such as 
strapdown inertial system algorithm evaluation). It is expected 
that bus communication between computer and device will be com- 
posed of short blocks of data from one to several bytes in length, 
typically 1 to 128. Data transfers of larger blocks (e.g., CRT 
display frames, experimental data recording) are usually not time 
critical, and may be achieved by repeated bus I/O. If 8 bytes 
suffice for device address and address echo check, and assuming 
a typical 80:20 mix of short (4 byte) and long (128 byte) bus 
communications, the time to service 256 devices is derived below: 


Bytes 


Bits 


Short 

Contrc 

Command 

>1 

Leho 

Data 

Total /Device 

X (# of 
devices) 

Total 

... 4 

4 

4 

12 

96 

206 

2 .10 4 

Long 

4 

4 

128 

136 

1088 

50 

5.10 4 

All messages 







7.10 4 


A complete service cycle of all devices on a maximally 
configured bus thus generates 70K bits. For a 10 MHz transmis- 
sion frequency this cycle can be repeated every 7 milliseconds. 
In practice, delays due to finite transmission speeds will in- 
crease the cycle time, but a 10 ms to 20 ms bus service cycle 
seems to be entirely achievable. A 20 ms cycle, with the pre- 
ponderance of long communications assumed, will generate about 
300K bytes/sec of actual data, i.e., a data rate comparable to 
that of the higher speed storage devices such as drums, disks, 
and tapes. However, a data bus differs significantly in the 
manner in which this data is addressed and controlled. 
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The type of bus described is essentially a table-driven 
device: in practice, communication between the computer and the 
avionics devices will occur as follows: 


a) 


b) 


c) 

d) 


A number of device interfaces will need to be accessed 
for real time data at the highest service cycle frequency, 
i.e. , every 10 ms to 20 ms. 

Others will require accessing periodically, but at lower 
frequencies than the maximum. 

Some will require occasional sampling of random intervals. 


Some devices may be attached but may 

, , .. . 4 i- r IV, v r. . ! - 1 • 1 or; 

Oi u (JUiUjyU lu x. activity . 

health must be continuously known. 


not be components 
, their status and 


e) 


The remaining interfaces may not even be attached. 


The mix of devices in each category is a function of mission phase 
and/or station operations. It is a delicate design problem to 
ensure that all the highest frequency requests are complete with 
out exceeding the basic bus service cycle, and without losing 
some of the less frequent requests. Since these constraints are 
known only to the system implcmcnter, specific bus configuration 
should not bo wired-in to the hardware (or system software) of 
the computer or I/O controller. 

The device accesses can be organized into a set of I/O 
Each table contains the list of accesses to be ac com- 
at a given frequency. Figure 6.2 illustrates an example 
a table, made up of entries for bus 1/ O to be accom- 
for K = 1 (every service cycle) , K = 2 (every other cycle) 
K = 4 (every fourth cycle) , and so on up to K = 64. K need not 
be in powers of two, but it is felt that this makes table mech- 
anization much easier, and is not. a serious burden 'to the avionics 
system implementer . 

Each entry in the table is a request for bus I/O. Such 
a request may consist of one or more words with fields which 
contain the following information: 


tab les . 
pi i shed 
of such 
plished 


IOC Command BCU 


Command i Bus Command j Device Operand [ Memory Address_ 


Figure 6.3: Typical Bus I/O Request 
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a) I/O controller field specifying I/O channel type (i.e., 
bus) channel number, channel command. 

b) Bus controller field specifying special instruction 
to BCU (e.g., table update, check device status, etc.) 

c) Bus command field specifying device address and bus 
operation (e.g. , read, write, set mode, get status, 
etc . ) 

d) Device operand field spci fying operation to be performed 
by specific avionics subsystem (interpretation known 
only to device) 

e) Destination field -- address and length of memory area 
in which result of bus I/O is to be placed, or from 
which output is to be taken. 

As each I/O request is executed the appropriate data 
is transferred between memory device. The question now arises: 
how is the table of I/O requests to be interpreted and where does 
it reside? Several alternatives present themselves: 

a ) It resides entirely in main (operating) memory and each 
entry is treated as a separate I/O request to the soft- 
ware executive I/O routines. If there is a large num- 
ber of high frequency entries this will create an I/O 
bound condition, and much process swapping in a multi- 
programmed environment. 

b) The table of I/O requests resides in the I/O controller 
and is executed there independent of main processing. 
Only the result of each request is transferred to mem- 
ory. This relieves the interface between the I/O con- 
troller and the operating memory of traffic generated 
by control statements. 

c ) The table of I/O requests and the resulting data re- 
side in buffer storage local to the I/O controller. 

Data transfer is in block updates between minor cycles. 

The progression from a) to c) implies an increasingly 
elaborate I/O controller. It also incurs the problem of buf- 
fering the bus I/O data. If a user program no longer has the 
ability to place each individual request, than it has no know- 
ledge of when an update to (or from) the requested bus device 
is made. This is especially critical for blocked data, where 
it is essential to ensure homogeneously updated elements of the 
block. A mechanism for preventing multiple access to data blocks 
must be provided such as a TEST and SET operator, or multiple 
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buffers with switchable pointers: the first incurs delays (cri 

tical to an I/O process) , and the second consumes memory space. 

The localization of bus I/O in the IOC allows high 
frequency bus-computer communication to be conducted .without 
the several milliseconds delay normally associated with I/O 
devices such as drums or disks, and obviates the need for pro- 
cess swapping to maintain throughput. The low frequency or 
random bus communication can be handled in a conventional fash- 
ion as a single I/O event. Such requests can be treated as 
temporary insertions into the bus I/O request tables, which are 
removed when serviced by the bus. Completion of the request 
can be signalled by an I/O complete interrupt. Division of bus 
j. /0 requests into repetitive and random categories depends on 
the trade-off between IOC complexity, I/O buffer size, bus ser- 
vice frequency, and throughput. 


6 . 3 Mass Storage I/O 

The most critical function of secondary storage is 
as part of the multiplexed operating memory hierarchy. Whether 
flie technique employed organizes memory into fixed size blocks 
(pages) or variable sized blocks (segments) , it is essential to 
be able to locate and transfer to and from secondary storage 
fairly large amounts of stored information (from tens to thou- 
snads of words) , in a minimal time. 

The traditional disk or drum memory systems possess 
characteristically long latency and/or access times (on the 
order of tens of milliseconds) , and data transfer to those 
devices is performed in parallel with other CPU activity by an 
independent processor. It is anticipated that early Space Sta- 
tions will still employ rotating magnetic storage devices and 
that I/O will continue to be concerned with their optimal usage. 
It is important to realize that a subsequent change to solid 
state mass storage (with little or no access delay) can radi- 
cally modify the concept of memory multiplexing, to the point 
where it may not be done via the I/O controller. In the present 
discussion, we will assume the conventional core to disk inter- 
face requirement. 

The major concerns with optimal usage of the disk are: 

a ) Since access times are long (typically 10 to 100 milli- 

seconds), but transfer rates are high (typically 5 to 
10 MB/s) , it is desirable when a request for a missing 
memory block is honored, that as much useful asso- 
ciated information is transferred along with the spe- 
cified block, since the cost of so doing is relatively 
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b) 


c) 


d) 


low. This involves maximizing the "locality" of the 
executing program which creates the I/O request, or 
otherwise anticipating its accessing behavior. 

Since requests for data take so long to honor, it is 
probable that, by the time a requested block is lo- 
cated and transferred, the requesting process is pro 
bably no longer running. It becomes desirable to allow 
the memory management to determine, at its convenience, 
when to alert waiting processes of their complete I/O 
requests. This may be done by causing a table of com- 
pleted I/O requests rather than to signal the system 
via an "I/O complete" interrupt, as is usually done. 

This may be done by causing a table of completed I/O 
i n j-<- r ' 1 V‘ r t/o contro l 1 or , and 

leuuthth tU uc uwoUjuaiUUvU — - ■/ ~ > 

only when no furi.her requests are pending, cause the 
I/O controller to interrupt the system to notify it that 
all requests have been expedited. A "quiet" I/O com 
plete scheme such as this is expected to greatly mini 
mize the "thrashing" of memory transfers that occurs 
when operating memory becomes overcommitted. 

The assignment of disk space can become as critical as 
that of operating memory. For a high degree of memory 
multiplexing, disk space can become badly fragmented 
with use, necessitating a compacting or rearranging o 
the assianment of files. In a real-time system it may 
require prohibitively long search cycles to update all 
references to files that are re-assigned. Disk addres 
ses can be organized in a central directory which maps 
logical into physical address space. This can be ac- 
complished in main . memory , at considerable cost of space, 
or on the disk, at the cost of more complex hardware m 
the disk controller. 


Other traditional I/O problems (such as the trade- 
off between I/O request frequency and I/O buffer space 
in main memory, and the related question of logical 
file blocks and how to assign them to a device that 
is orgEinized into physical records) still remain in 
a Space Station environment. But, as stated in the 
beginning, these questions are of less significance 
in an environment whose work load and user requirements 
are less variable and more known. A less generalized 
approach to file directory management may be possible 
than is found in general purpose ground-based facili- 
ties such as the larger. IBM 360 installations. 
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6.4 


I/O Controller Design 


This section will describe the functional elements of 
a oposed I/O controller design. Detailed implementation ques- 
ts are beyond the scope of the present contract. Figure 6.4 
in cates the basic functional elements. 


6.4.1 Central Control (CC) 

The central control unit provides the decoding of the 
I/O operations, for the initiation and synchronization of com- 
mands, and for data transfers between the units. The CC contains 
an arithmetic unit and the logic required to perform conditional 
decisions. The sequences issued by CC are stored in a micro 
control memory and are initiated via commands from the various 
interfaces . 


6.4.2 


Interprocessor Communication Interface (IPCI) 


Some mechanism is clearly required for communicating 
between processors and the I/O controller. This is necessary 
for interprocessor interrupts, I/O commands, and recovery form 
processor faults. 


The IPCI provides the interface to the interprocessor 
communications bus . One may reasonably question whether a sep- 
arate interprocessor communication interface is required. Can 
not all the communications go through M2? 


If all the interprocessor communications occur by 
writing into M2 and reading from it, then the answer to the 
above question is no! The overhead due to constantly polling 
M2 would waste processor time and create excessive M2 conten- 
tion . 


If processor communication uses the internal bus, as 
a communications media, by-passing M2, then the answer is pro- 
bably yes. The use of the internal bus as the communications 
media is just an implementation decision. The fact remains that 
distinct communication between processors and between processor 
and I/O must occur, outside of M2. The logical decisions per- 
formed by IPCI must exist whether a physically separate inter- 
processor communications bus (IPCB) is employed or not. 

A wide variety of signals are communicated over an 
IPCB. Some are between processors. Others involve I/O trans- 
actions. Some examples are given below: 


6-10 


INTERMEIRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 



V) 

3 

CQ 



L_ 


♦ 


6 - 


oxosuoo 



Figure 6.4: Elements of I/O Controller 














a) If local memory Ml is employed, then a potential pro- 
blem exists in updating common information (for ex- 
ample, descriptors ( contained within different Ml's. 
The control of the updating requires interprocessor 
communications . 

b) The loading (initialization) and dumping (for a pro- 
cess swap) of Ml can be triggered within a processor 
or commanded from another processor (in case of an 
error condition) . 

c) When a processor fails or detects an M2 failure this 
information must be signalled to another processor. 

d) All the commands issued by I/O executive routines must 
be sent to the I/O controller over some communicating 
link . 

e) All "done" or "error" interrupts generated by or pas- 
sed on by the IOC must be steered to a processor over 
a communications link. 


6.4.3 Operating Memory Interface 

This interface element controls access to memory by 
the various channels. It is, in effect, the DMA channel for 
the I/O controller. The priority as to which I/O interface has 
access when contention exists is fixed. The following is sug- 
gested: 


Priority 1 (highest) Channel 1: The devices which 

operate in the burst mode must be serviced at a rate consistent • 
with their data rate. M3 can possess a data rate of up to 10 
MBPS, which is three to six times less than the M2 data rate. 
However, channel 1 devices cannot sustain a large delay between 
a request for an M2 transfer and tlie final servicing of the re- 
quest since the addressed record is usually not fully buffered 
and M2 and the auxilliary device must be synchronized during a 
data transfer. 


Priority 2 Channel 2: The devices which are driven 

by tables in the local memory of channel 2 present to M2 a 
data rate three to six times less than that of channel 1. Yet, 
if too much delay is introduced in each M2 transfer, the minor 
and major cycle times might be exceeded. 

Priority 3 Central Control: When the CC receives a 

command over the IPCB it often has to .fetch an I/O control word 
from M2. While this fetch can be delayed a reasonable amount 
Of time, queueing of too many IPC commands before execution 
must be avoided. 
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Priority 4 Channel 3: The devices attached to channel 

3 arc all slow speed and involve only a few bytes per transaction. 
A delay of ten to even one hundred M2 cycles will not appreciably 
affect the performance of these devices. 

Priority 5 (lowest) : Since the interrupt priority 

and timer elements of the I/O unit do not use M2 to a signifi- 
cant extent, these elements are placed in the lowest priority 
category . 


6.4.4 Channels 


These control the interface to the device categories 
defined previously, namely; 

a) the high speed disk (or drum) and tape 

b) the avionics data bus 

c) slow speed unit record equipment. 

Each channel will contain buffer capacity appropriate to the 
device, and a set of instructions tailored to the control re- 
quirements of the device. 

6.4.5 Interrupt Handler 

Although not a unique location for the interrupt con- 
trol mechanism, the I/O controller often contains this function. 
There is some advantage in handling external interrupts and pro- 
cessor traps with the same mechanism. 


6.4.6 Timer 

The real time aspects of the MP system require access 
to a precise time standard. Also the capability of generating 
an interrupt at a predetermined time , probably by means of a count- 
down mechanism, is required. Each counter must be addressable 
from a processor for initialization or readout. These counters 
are placed inside the I/OC for convenience, thus saving the 
cost of providing a unique piece of equipment. 


6 . 5 I/O Configuration Organized for Recovery 

The I/O configuration presented in Section 6.4 indi- 
cates that a single I/OC is capable of servicing the multipro- 
cessor. If this design approach is taken, how can this single 
I/O meet the requirements dealing with recovery from a failure? 
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If two or more I/O units are required for system oper- 
ation then the recovery aspects of the I/O can be made very sim- 
ilar to those of a processing unit. Each of the I/O units 
would be configured like a processing unit with an M2 interface, 
a special interface to the Processors viei dual redundant commun- 
ication links, an M3 interface, and a data bus to the outside 
world. Single instruction Restart could be employed as the major 
recovery mechanism. 

Since only a single I/O unit is proposed to meet the 
performance requirements, a triple-redundant I/O unit with voting 
logic is a candidate design approach. Many transients are com- 
pletely masked in this configuration. If a permanent failure 
occurs then the voting elements can be reconfigured to compara- 
tors and the bad I/O unit taken off line for repair. 

Figure 6.5 shows a possible redundant I/OC employing 
the components described in Figure 6.4. The major features of 
this configuration are described below. 

a) The triple redundant I/O hard core contains the cen- 
tral control, timers and the interrupt control. A 
failure in this critical area will allow the system 
to keep running without propagating the error. 

b) In order to interface the TMR section with other dual 
redundant interfaces, voters and switches are provided. 
The S elements, which are controlled by their asso- 
ciated I/O elements, are used to select which of the 
dual redundant interfaces to accept data from. The V 
elements vote upon the triple redundant I/O outputs 
and produce dual redundant outputs. The voters will 
automatically reconfigure to comparators and switch 
out a faulty I/O where required. 

c) The IPCB, M2, M3, and data bus are all postulated to 
be dual redundant. For this reason their interfaces 
are shown to be dual and they interface to the I/O 
via the S's and V's. The multiplexer channel which 
contains peripherals necessary to operate a laboratory 
model is only shown as a simplex subsystem, with a 
corresponding single interface. 

d) It is assumed that all the peripheral devices attached 
to the data buses and the M3 controller possess char- 
acteristics which will aid in the recovery process. 
These characteristics include: 

1) hardware to aid in fault isolation between dual 
redundant threads 
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Figure 6.5: Redundant I/O Configuration 
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2) sufficient buffering, so that aborted commands 
cannot hang up a subsystem 

3) the ability to be reset and to indicate upon 
request the status of the I/O device 

e) Certain problems caused by locking of processes to I/O 

devices must be resolved by the operating system. This 
requires the capability of selectively deleting the I/O 
command created by a process which is cancelled (either 
purposely or as the result of a failure) from the appro- 
priate device queue. Also, the capability of relieving 
any M2 space allocated as the I/O buffer area must be 
provided . 

One of the main motivations for a triple redun- 
dant I/O central core is to reduce this problem as far 
as I/O failures are concerned. A failure within the 
central TMR I/O cannot propagate past the voters. How- 
ever, a voter or channel failure can cause a temporary 
suspension of I/O or a re-issuing of an I/O command 
and the associated problem of releasing any I/O locks. 


References for Chapter 6 

1) North American Rockwell, Space Division, "Modular 

Space Station Phase B Extension - Information Mana- 
gement Advanced Development Report", Contract NAS9- 
9953, MS C- 02 4 7 1 , July 1972. 
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Chapter 7 

FAULT TOLERANCE PHILOSOPHY FOR THE SUMC MULTIPROCESSOR 


The purpose of this chapter is to present the study 
results in terms of error detection, fault isolation and re- 
covery philosophy as applied to a multiprocessor system. 


7.1 Re cp lircment s_ 

The requirements postulated for the system, as a 
result of the study, are delineated below. 

a) The only interaction that the applications programmer 
should possess with the fault tolerant aspects of the 
system is to specify whether and under what conditions 
a program or sequence of events is to be critical. A 
critical program is defined to be one which must be re- 
coverable in the event of a fault. A non-critical pro- 
gram is one which need not recover. 

By classifying a program as non-critical certain 
design considerations must be kept in mind. The ab- 
rupt termination of a non-critical program in the mid- 
dle of any instruction should not create a situation 
which will prevent the execution of other critical 
tasks. Any Compool data which is used by a non-critical 
program can not be left locked. The failure of a non- 
critical program can not . lock out a piece of peripheral 
equipment from use by a critical program. 

b) It seems reasonable that for certain applications a 
recovery time of 10 to 100 ms could be required, es- 
pecially for certain real time control applications 
with iteration rates of 10 to 50 times per second. 

Other critical functions might take longer. The accep- 
tance of recovery times of 1 minute or more essentially 
means that the program, which is to be recovered does 
not fall in the real time category. 


7 . 2 Error Detection 

The most fundamental conclusion that has been reached 
in the error detection area is detection of hardware failures 
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must b o c ompletely a hardware function . (We are confining our 
discussion to faults within the internal structure of the mul- 
tiprocessor. Peripheral I/O devices can, depending upon their 
characteristics, employ central processor software to provide 
diagnostic capability.) The above conclusion is based upon 
the following reasoning: 

a) An important aspect of any system which is to recover 
from a fault is to detect an error within a period of 
time which guarantees that the error hasn't propagated 
to a point where recovery becomes impossible. Assum- 
ing a given error is detected by a software self test 
routine, it is gene.ral.ly impossible to determine what 
information in memory has been incorrectly modified. 
Without the ability to isolate the damage, repair can- 
not be effected and recovery becomes unattainable. 

Hardware error detection mechanisms such as parity, 
comparators and specialized logic provide a continuous 
monitoring upon the system. Software test routines 
can only be executed periodically in time. 

Error detection logic, properly designed, will 
more nearly approach the goal of instantaneous error 
detection which prevents the propagation of failures. 

b) If software self-test were to be employed one must con- 
sider the question of how long it will take to execute. 
Hardware error detection need impose little if any 
overhead upon the systcim performance. Software can 
spend a considerable amount of time for two reasons: 

1) To be comprehensive an extremely large number of 
tests must be run. 

2) They must be executed at a high frequency. 

The unfortunate thing about software self-test in the 
past' has been that, in most cases, hardware was not 
designed with self-test in mind. It was very diffi- 
cult for the software to control precisely the hard- 
ware state. Micro level diagnostics tend to allevi- 
ate this problem to a degree. Because of an inabi- 
lity to test easily all features of a system, self- 
test software demonstrates the phenomenon that a 
large percentage of equipment functions can be tested 
with a relatively small amount of code, while the 
final few percent of the equipment tests require a 
very large amount of code. 
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c) 


The periodic nature of software error detection makes 
transient error detection difficult. Two categories 
of transients may be isolated: 

1) Type 1 transients cause a temporary incorrect 
electrical signal but do not change the state 
of any storage element. 

2) Type 2 transients occur at such a point in the 
sequencing of a processor that incorrect storage 
occurs. The hardware satisfies all tests that 
can be invented, yet bad information may exist 
which will eventually cause incorrect system per- 
formance. 

If a type 1 transient is not detected it hardly mat- 
ters to the functioning of the system. However, an 
undetected type 2 transient could possibly be catas- 
trophic. An error detection philosophy which provides 
a continuous monitoring at critical points is neces- 
sary in order to prevent type 2 transients from going 
undetected and propagating. 

Micro diagnostics, although more comprehensive 
and easier to write than software, must still be exe- 
cuted on a periodic basis. Their ability to detect 
transient failures must be seriously questioned. 

d) The final point against software diagnostics as the 

sole error detection mechanism is that failures can 
occur which disable the execution of the software. 
Therefore, the signalling of fault condition can not 
occur . 

7.2.1 Implementing Hardware Error Detection 

Error detection is intimately involved with the specific 
failures modes of devices and equipment. If the various failure 
modes and the propagation dynamics of the failures are studied, 
then, in specific instances, the addition of a moderate amount 
of logic can detect the anticipated failures. On the other 
hand, one would like to employ techniques which are not very 
dependent upon the specifics of the equipment in order to pro- 
vide a degree of flexibility and generality. The appropriate 
decision between specialized and generalized error detecting 
logic is a matter of engineering judgement. 
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7.2. 1.1 Processi n g Un.it : The processing units of the multipro- 

cessor arp the major sources of error propagation. If incorrect 
write operations are executed, due to a failed component, then 
the normal sequencing of the processing units, using this in- 
correct data, can cause propagation of the error to other por- 
tions of memory. Propagation of errors can extend beyond the 
multiprocessor system, if incorrect I/O commands are issued 
and executed. Because of the potential devastation caused by 
a processing unit failure, a maximum design effort must be un- 
dertaken to detect P failures before they propagate to other 
parts of the system. Within the limits of practicality, an 
effort must be made to detect almost all failures within P, be- 
fore incorrect write operfitions or invalid I/O operations are 
executed . 


Based upon these objectives, the study conclusions sug- 
gest that processing unit error detection be accomplished by em- 
ploying t w r o synchronized but independently operating processors 

with a fail-safe comparator placed across the memory i nterface. 

Some of the reasons for this conclusion are presented - below" : 

a) Periodic software self-test cannot catch all failures 
before they propagate to multiple errors. 

b) Error detecting codes internal to the processing unit 
cannot detect a large category of failures. For ex- 
ample, the failure of a control signal can cause al- 
most every bit in a word to be incorrect. The use of 
arithmetic codes, such as a Modulo 3 check, produces 
inconsistent results under operations such as AND, OR, 
Not. 

c) It will require at least twice the logic, and incur 
more than twice the cost, to detect all possible single 
component failures in P. Therefore, the cost of a 
dual P unit is reasonable. 

d) The redundant processors can be packaged separately 
with independent power distribution. This will more 
closely meet the failure independence assumption. 

e) Redundancy with a comparator at only one interface will 
reduce the number of interconnections between the re- 
dundant processors. 

f) Errors are detected before bad outputs may propagate 
from the P. The comparator placed at the output of P 
might allow an error to propagate within P, but no 
bad information leaves P. 
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If one were to design a processor considering error de- 
tection as one of the main specifications, then each 
module could be designed to detect its own errors . Ap- 
propriate design efforts must be spent in maintaining 
statistical independence between failures and preven- 
ting errors in the error detection logic itself from 
going undetected. This innovation to the logic design 
effort would prove to be an interesting research topic. 
As far as employing the present SUMC design as the 
processing element of the multiprocessor, the use of 
two SUMC elements with a comparator seems to be the 
most reasonable approach. 


7. 2. 1.2 Memory: The irregular structure of the processor leads 

one to consider the use of dual processors as a cost erfoctive 
error detection mechanism. Memory structures tend to be very 
periodic in nature, possess little if any combinatorial logic 
outside of the addressing area, and therefore, are more amen- 
able to the use of error detection codes. Simple word parity 
is a degenerate case of an error detection code. 

Memory can be a significant contributor to the hard- 
ware cost of a multiprocessor system. For this reason, tech 
niques other than brute force duplication of memory modules 
should be considered for error detection purposes. Depending 
upon the details of the construction of memories, different 
techniques can be employed. The following suggestions are made 
and seem to serve the, purpose for most state of the art mem- 
ory architectures. 

a) Word parity can detect single memory cell failures, 
sense amplifier failures, and other failures which 
manifest themselves as single bit errors. 

b) The incorporation of parity upon the address of the 
word proves satisfactory in detecting the failure of 
a single bit in the memory address register. 

c ) Employment of special current threshold circuitry 
can detect the simultaneous selection of more than 
one memory word at a time . 

d) The use of a time-out indication can detect the fail- 
ure of a memory module to sequence. 

e ) The use of a wri te-and-verify mode of operation, where 
every word written into memory is immediately read 
again, can verify correct storage. This is particu- 
larly applicable to NDRO type memory structure. For 

a DRO memory system one must face the problem that the 
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read operation which is used for verification must be 
followed by a write- f or-res toration of the data. A 
failure can occur during the second write operation 
which would go undetected until the stored word is 
used again. However, the write-and-verify operation 
is still useful in detecting failure inodes associated 
with transient addressing, control or bit storage 
failures . 

f) Integrated circuit memories possess enough redundant 
addressing logic so that a partitioning of the memory 
into independent bit planes allows word parity to de- 
tect a large number of addressing errors. Present 
state of the art integrated circuit memories contain 
address decoding on each memory chip. Chips can be 
configured to contain one, two or four bits of 1024 
words on each chip. Since each chip contains its own 
address decoding, a failure of a chip can only mani- 
fest itself as an error on the output of the chip it- 
self. That is, it is localized to a few bits of the 
word. If each chip contained only one bit of each 
word, then a single word parity bit would detect all 
address decoding failures. 

g) The use of separate read and write logic in the con- 
trol area of the memory module will prevent a read 
command from turning into a write command, due to a 
single component failure. 

7 . 3 Recovery 

When a module of the multiprocessor fails, the presence 
of a spare (physically identical module) which can execute the 
same function does not necessarily mean that recovery can be 
accomplished. A failure not only eliminates certain physical 
resources (hardware) from potential allocation to executing pro- 
cesses, it also destroys information (program, data and status) , 
which is required for execution. The major problem associated 
with recovery is not the necessity of providing spare hardware 
with an appropriate reconfiguration switching mechanism. It 
is, instead, the problem of re-establishing all the information 
required by the process to recover. In order to achieve 
recovery, the system must be returned to some past state which 
is known to be correct. 

What exactly determines the state of a system? If 
real time is ignored, for the moment, then the system's state 
can be defined to be represented by the contents of all the 
storage elements, including Ml, M2 and the Processor's control flip 
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flops. The more dynamic changes to a system's state are contained 
within Ml and P. M2 possesses less dynamic changes with time. 

M3 is even moire static. As one proceeds from the more dynamic 
to more static elerntns of a system's state, time becomes less 
important to the recovery process. therefore, software, which 
is more time consuming than hardware, can be employed. 

The discussion on recovery will address three major 


areas : 




a) 

The 

processing unit , 

P and Ml 

b) 

Opo 

rating memory, M2 


V 

^ / 

Inp 

uL output control]. 

or (I/OC) and its channels 


Suggested approaches to recovery from both transients 
and permanent failures in these three hardware areas are pre- 
sented. 


7.3.1 Processing Unit (P-Ml) 


7. 3.1.1 Restartable Instructions: One of the main suggestions 

generated by this study, relative to a recovery from a proces- 
sing unit failure, is to desi g n all inst r uctions to be restart - 
ablo. This means that the poTnt of recovery is the instruction 
duiTng which the failure was detected. It is assumed that all 
failures are detected essentially instantaneously so that pro- 
pagation of the failure does not cause incorrect information to 
be written into M2 or bad I/O commands to be executed. 

Although a restartable instruction is not a difficult 
technical feat, it does require a design effort. The following 
ground rules must be applied during the design implementation 
of each instruction: 

a) Each instruction must be partitioned into two phases. 
During phase 1 the instruction is fetched, data is 
read, computations are made and all memory write op- 
erations are placed into a temporary buffer area for 
execution during phase 2. 

b) During phase 2 the buffered information is copied 
into its final destination in Ml and M2. The contents 
of the buffer area are not destroyed unti all the copy 
cycles are completed and verified. Each phase is de- 
signed to be separately restartable. Figure 7.1 sch- 
ematically represents the execution of a generic re- 
startable instruction. 
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c) 


d) 


If a failure indication occurs during phase 1, then 
the old copy of the program counter indicates which 
instruction was being executed. All of the infor- 
mation needed to execute the instruction has not 
changed, so phase 1 can be re-initiated. If a fail- 
ure occurs during phase 2, then, even though some in- 
formation might have been copied, the information tem- 
porarily buffered in Ml is still valid, and a complete 
re— initiation of phase 2 is; indicated. 


Interrupt testing can either occur at the end of phase 
2 or at the beginning of pjhase 1. It is assumed that 
all interrupt conditions are caught in latches, so that 
the interrupt test is just a matter of reading these 
latches and determining whether to fetch the next in- 
struction in the instruction stream or to enter the m 
terrupt control micro-routine. The interrupt control 
micro- routine must be designed to be restartable and 
it must incorporate the concepts of a double phase 
operation with a buffer area, i.e., the interrupt con- 
trol micro- routine can be considered to be a restart- 
able ins true t i. on . 


What does a restartable instruction design allow the 
system to do? 

a ) For transients which interrupt the normal execution 

sequence, but do not destroy data, the retry Oi an 
instruction will provide a simple method of recovery. 

h) For transients, where information is modified, the 

information must be restored before the instruction 
is retried. The restoration of the lost P or Ml in- 
formation can be accomplished by either error correc- 
tion codes or by duplexed storage. 

It is proposed that each instruction be designed 
so that after an instruction is executed, the state of 
the processing unit is always contained within Ml. 

Each processing unit would contain two Ml's so that in 
the event of failure of one, the information contained 
in the second could be used. The size of Ml should 
jito be more than 100 words and so its duplication pre 
sents little hardware impact. 

c ) Recovery at the instruction level allows the entire 

operation to be independent of the application pro- 
grammer. Hardware and operating system primitives 
can determine when and how to restart. All considera- 
tions are based upon detailed information below the 
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instruction level. The application programmer could 
not care less about these details. 

Because single instruction restart (SIR) allows 
a very quick recovery mechanism, one is not even con- 
cerned about the impact of the delay between error 
detection and recovery. This should be well within 
the iteration period of the highest frequency periodic 
application function. 

d) Error detection within the instruction cycle as well 

as SIR tends to eliminate questions of error propaga- 
tion and the interactions between a failure and the 
informational content of the rest of the system's 
storage . 


7. 3.1. 2 Critique of Alternatives: Why the emphasis upon a re- 
startable instruction? What are - the alternatives? 

a) In a batch processing system where multiprogramming 
is not used, the failure of a processing unit catches 
only one program in a running state. All the submit- 
ted programs are completely independent and recovery 
is simply a matter of reloading the program and data. 
Many functions on the space station can be handled 

by this "fresh start" approach. It is simple and 
imposes minimum overhead. 

However, the real time aspect of some of the 
space station processing requirements makes the "fresh 
start" approach unfeasible. 

b) A "checkpoint restart" approach to recovery has been 
applied to systems where problems requiring hours of 
computer time are being run. At fixed intervals the 
complete contents of core as well as the processor 
registers are dumped onto a back up area on disk or 
tape. A snapshot is taken of the system's state. 

A superficial look indicates that with a 1 psec 
cycle time and a 10 OK word memory, a memory dump can 
be accomplished in 100 milliseconds. This is not an 
unreasonable time. However, let us investigate the 
implications of "checkpoint restart" a little more 
deeply . 

1) If a snapshot requires 100 ms then one must con- 
sider its effect upon system throughput. If one 
desires to limit the overhead imposed by this 
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function to less than 5%, then a snapshot can not 
be taken more than once every 2 seconds. If real 
time requirements allow a recovery time of 2 sec- 
onds, then "checkpoint restart" might be a viable 
candi d cl te . 

2) If the contents of operating memory and the proces- 
sing units are rolled back 2 seconds in time, can 
one guarantee that the state of the mass memory 

is always consistent? Must the contents of mass 
memory also be dumped when operating memory is 
dumped? In general, the answer is yes. In a 
virtual memory system where memory hierarchy must 
not contain inconsistent information. Dumping M3 
periodically onto some archival storage device 
such as tape (M4) seems to eliminate check point 
restart as a valid candidate for recovery in a real 
time environment. 

3) Even though M2 can be dumped in 100 ms; a disk, 
drum or tape probably couldn't absorb the data at 
a rate higher than 10 MBPS. This will increase 
the snapshot time for 32 bit words to 320 milli- 
seconds and the snapshot period to once every 6.4 
seconds . 


7.3.2 Recovery From an Operetting Memory (M2) Failure 

Hardware failures and electrical transients in memory 
systems cause information to be destroyed. Recovery from a mem- 
ory failure would be very easy if the error patterns caused by 
failures and transients could be known with certainty. Many 
error patterns could then be corrected by employing error cor- 
recting codes. Unfortunately, it is impossible to analyze all 
possible failure modes under all possible environments to de- 
termine all possible error patterns. Failures exist which can 
not be corrected by error correcting codes. Error correcting 
codes are not useful when the timing mechanism fails in such a 
way as to prevent memory access. A failure in the addressing 
mechanism can not be corrected by the encoding of data. 

Error correcting codes can be successful when the pre- 
dominant error modes are single bit failure or small burst fail- 
ures. In general, however, d uplication of the information co n- 
t ained within the memory cells is required for successful re - 
covery from an M2 failure . 


7. 3. 2.1 Pro blem Area s; When attempting to design a system which 
is recoverable from M2 failures, a number of distinct problem 
areas must be resolved: 
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a) 


b) 


c) 


d) 


Memory Management 

Normal (non-failure— tolerant) memory management 
deals with the allocation, deletion, and control of 
memory space for program and data entities. When fail- 
ure recovery is mace a requirement, additional questions 
arise? how to deal with redundant storage of critical 
information? How shall the hardware and software in- 
teract to: 

1) enable the continuous storage of redundant infor- 
ma ti on? 

2) allow the accessing of valid information in the 
presence of a fault? 

Hardware Fault Isolation 

When a memory error is discovered, how can it be 
isolated to a repairable piece of equipment? 

Information Fault Isolation 

If the failure is isolated to a specific memory 
module, one must be able to determine what informa- 
tion was destroyed so that recovery action can be con- 
trolled. 

Storage of Redundant Information 

Since the redundant storage of information be- 
comes a necessity .for critical programs and data, a 
question arises as to how and where the redundant in- 
formation should be stored; in M2 or M3 or a combina- 
tion of both? 


7 . 3. 2 . 2 Factors Behind the M2 Recovery Approach : A number of 

considerations pointed to the suggested M2 configuration. The 
following items consist of assumptions, observations and the 
philosophy which leads to the approach presented in the next 
section . 

a ) Consistent with the processing unit's failure recovery 

philosophy, the applications programmer should not be 
concerned with the details of the recovery procedure. 
This is handled by the hardware and operating system. 
There is however, one aspect that must involve the 
application programmer. He is the only one who can 
initiate the specification as to which program and/or 
data segments are critical. By definition, critical 
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segments are all those segments used by programs which 
must recover and continue execution after a failure. 

Non-critical programs need not recover. They must, 
however, be terminated in such a way so as not to inter- 
fere with critical programs. This is called Fail Safe. 
Some observations and requirements necessary to enable 
a program to Fail Safe are presented in Section 7.4 

Once the applications programmer indicates the 
programs which are critical the compiler can statically 
assign critical or non-critical status to segments it 
creates. Similarly, the operating system must also as- 
sign criticality status to segments it dynamically 
creates: For example, a stack. 

b) The recommended approach to memory management is to 
employ a segmented virtual memory system. 

The virtual memory approach allows an exploita- 
tion of the difference between read-only (program and 
fixed data segments) , and read-write (variable data 
segments) information. If an M2 module which contains 
program segments fails, it is desirable to exploit the 
virtual memory mechanism, already implemented within 
the system, to aid in the recovery process. 

Most program segments can be considered to reside 
in M3. They are brought into M2, on demand, for exe- 
cution. If the program segments contained within the 
failed M2 module were, as the result of the failure, 
made "not present" , then the M3 to M2 transfer mech- 
anisms will allocate space and transfer anew the re- 
quired segments automatically. The "not present" seg- 
ment indication is contained within the program seg- 
ment descriptor. Descriptors are considered to be 
data and are in turn stored redundantly in M2. 

c) For a large computational system on-board a space 
station, it is reasonable to assume that repair or 
replacement of a failed M2 module will be performed 
relatively quickly. The hardware error detection mech- 
anisms should be able to isolate to a repairable unit, 
and to indicate the action to be initiated by the soft- 
ware . 


However, there must be sufficient M2 space avail- 
able so the system can run without "thrashing". This 
entails modifying the work load so as to reduce the 
memory required to accomodate the working sets of the 
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remaining processes. Possibly, the number of processes 
of particular - types might be limited to reduce the work 
load . 


7 . 3 . 2 . 3 Proposed Configuration for M2 Failure Recovery : The 

proposed configuration defines an M2 modulo as four M2 units 
which are interleaved on their low order address bits (see Fig- 
ure 7.2). 


Information segments may either be stored in a simplex 
or duplex mode. The mode is specified within the descriptor. 

Most program code would be stored simplexed and interleaved 
across the four memory units. Most critical data segments would 
be stored duplexed. In the duplexed storage mode address i and 
i + 1 contain identical inf ormation . That is, two adjacent mem- 
ory units contain identical copies of the redundant words. 

A minimum of two memory ports connect to the redundant 
P interfaces. Communication with any M2 unit can occur through 
either port. This is under control of the command issued from 
the processing units. 

M3 is used to backup most program segments. M2 is used 
as the backup for data and certain critical program segments. 
Program and Data Segments can be stored anywhere in M2. When 
space is assigned to a critical data segment, a double size "hole" 
must be found in M2. This does not impose any extra effort upon 
the memory management function. 

Redundant writes into independent units of M2 are ac- 
complished automatically via the dual redundant processing unit 
bus links. Recovery of M3-backed-up information requires making 
the segment "not present" . The memory management routine which 
handles segment faults will automatically reload the M3 segments 
when required, on demand. 

Whenever an M2 error is detected, the error indications 
are communicated to both halves of the processing unit so they 
can continue to perform identical operations. The ability to 
restart an instruction can be exploited in attaining system re- 
covery after an M2 failure. As soon as an M2 error is detected, 
the processing unit traps to a special micro-routine which boot- 
straps into the sequence indicated in Figure 7.3. After recov- 
ery, the instruction which was terminated by the trap can be 
re-executed (if the M2 error was detected during ^ of the in- 
struction) or the instruction may be completed (if the M2 error 
was detected during $2 of the instruction) . It is interesting 
to note that M2 read operations occur only during while M2 
write operations occur only during 2 - 
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Figure 7.2: Interleaved Memory Units 
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Figure 7.3: M2 Error Indication 









When an M2 error indication in first recorded, the M2 
operation will be tried again. If the error does not recur 
then a type 1 error is indicated. However , if the error indi- 
cation persists a search is made to determine which segments 
are stored in the suspect unit. To accomplish this search in 
the. presence of a failed unit, the header word containing a 
pointer to the segment descriptor as well as a link to the next 
segment is redundantly stored. Figure 7.4 shows the storage 
allocation for both simplex and redundantly stored segments. 

All non-crit.ical segments within the suspect module 
are put into a "dead" state. 

Critical segments can be either redundantly stored or 
not. A redundantly stored critical segment is written out to 
M3 so normal memory management can be used to allocate new space 
for it when required. Since it is assumed that failures do not 
simultaneously affect both copies of redundantly stored infor- 
mation, the good copy can be accessed after a failure. 

Non redundantly stored critical segments are made not 
present. Fixed data and programs fall into this category. 

For all M2 failures, statistics are maintained indica- 
ting a failure history. If an M2 module develops a bad history 
of failure, then it will be removed fron an active status. The 
definition of how many failures within a given time period indi- 
cates a bad history, can be considered a design parameter depend- 
ing upon whether transient or permanent hardware failures con- 
stitute the predominant failure mode. 


7.3.3 Fault Tolerant Aspects of the I/OC, Channel 

This section will address problems associated with re- 
covery from a transient or permanent failure in the I/OC or com- 
munication channel between the I/OC and the device. 

Many constraints must be placed upon the I/OC, channel 
and the attached devices and controller. Figure 7.5 presents 
schematically the elements which will enter into the discussion. 
Only one I/OC, channel and device is shown. Clearly more exist 
in a real system. Our discussion will focus on only one I/OC, 
channel and device at a time. 


7. 3.3.1 Incorrect I/O Commands : The basic recommended approach 

is to eliminate the possibility of executing incorrect I/O com- 
mands. As a general principle all I/O devices require some de- 
gree of feed back to the MP, if any fault tolerant design goals 
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M2 Memory Module 





Legend : 

MU}, = memory unit 

HW^ = Header word of i^-* 1 segment 

WjS^ = word of the segment 

HW.^ = redundant copy of HW^ 

/ 

(WjSjJ = redundant copy of WjS^ 

P = pointer to start of next segment contained in HW^ 

= redundant pointer contained in HW^ 

Figure 7.4: Storage Allocation in Interleaved M2 
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Figure 7.5: 


I/O Elements 
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Processing Unit 
Operating memory 
Input output controller 

Communication Channel between I/OC and Device 

Device and associated controller (if required), e.g., 
Printers, CRTs, IMU ' s , other computers, etc. 

Device error 

Channel error detected by device 
Channel error detected by I/OC 
I/OC error 

M2 error detected by I/OC 

Interprocessor communications error detected by I/OC 
Processing Unit error 
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are to be achieved. No external device can be allowed to run in 
an open loop mode without communi cations back to the I/OC. 


One of the most devastating aspects of I/O failure is 
the possible execution of illegal unv/anted I/O commands. A major 
design effort, which will impose constraints on the elements of 
figure 7.5, must be undertaken to eliminate or minimize the pos- 
sibility of incorrect I/O. Let vis look at a typical I/O sequence, 
with safeguards to minimize this possibility. 


a) 

b) 

c) 

d) 


The processing unit issues an I/O command to the I/OC. 

The I/OC reads the indicated M2 location to obtain the 
I/O descriptor. 

The I/OC sets up the channel and issues the command to 
the device. 

The device echoes the command back to the I/OC for ver- 
i f ication . 


e) If correct, the I/OC issues an execute sequence to the 
device. The device then executes the command which may 
reqviire reading or writing into M2. 

f) After execution, a finished indication is sent from the 
device to the I/OC and this status is set into the I/O 
descriptor in M2, or an interrupt is generated. 

Let us investigate the effect of a failure during any of 
the sequential steps listed above. Error indications can occur 
from many sources including P, M2, I/OC, channel and device. 

A failure indication, E5, E6 , or E7 during steps a and 
b allows time so the I/O can prevent the issuance of the command. 
If an error, El, E2 , or E3 is detected during step 'c , then the 
I/O must also terminate the command, since an execute has not 
been issued to the device. An E failure indication during step 
c should result in an emergency sequence to cancel the I/O re- 
quest already issued to the device. 

The echo check, step d, provides a positive verification 
(feedback) that the device has successfully received the command. 

An I/OC or channel failure indication during the execu- 
tion of a command must result in a sequence of operations which 
is very device dependent. This will be discussed in section 

7. 3. 3. 7. 
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7. 3. 3. 2 Super Critical Comma nds ; Although the I/O portion of 
the space~t’7ition is inadequately defined, it seems reasonable 
to postulate the necessity for a small number of super critical 
commands with the following properties. 

a ) It is most disastrous if the command is executed when 
it shouldn't be. 

b) it is better to abort the command or action if anything 
seems to be going w T rong rather than execute it incor- 
rectly . 

Examples of such commands might be "Stage the Rocket" , 
"Purge the Airlock", etc. What should be done if failure occurs 
during the execution of a super critical command? The answer is 
to make the command fail safe, by issuing it or a facsimile thru 
multiple channels to the device. Only when all the arming con- 
ditions for the command are properly set is the device allowed 
to execute. If any discrepancy is noted at the device, command 
execution must be held up for resolution by the MP . 


7. 3. 3. 3 Interrupts: In many instances, the system is faced with 

the problem _ oT'~" phantom" interrupts or missing interrupts. Fault 
conditions within the interrupt logic can cause undesired inter- 
rupts (phantom interrupts) or can possibly prevent the generation 
of interrupts which should occur. The action to be taken by the 
system in these cases is very dependent upon the interrupt condi- 
tion one is considering. 

Let us consider two cases: 

a) The Expected Interrupt 

Often interrupts are expected when an I/O device 
command finishes . The exact time of occurence of the 
completion of the I/O command is not known, but the 
worst case time may be estimated. 

A time out error indication is a simple mechanism 
which will inform the system that the I/O device has 
not finished executing the command or at least the "done 1 
interrupt has not been received within a given time 
period. If the I/OC and channel have a sufficient amount 
of internal error detection, the failure can probably 
be attributed to the device itself. 

The action to take might involve a limited number 
of retries of the operation or a call for system re- 
configuration which eliminates the device from use. 
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If a "phantom" interrupt occurs, which indicates 
a device end condition for a device which wasn't being 
used, then clearly this interrupt should be ignored by 
the system. This feature can be incorporated into the 
interrupt handling routines. 

b) Unexpected Interrupts 

These are a class of interrupt conditions which 
are provided for but which are unexpected. For example, 
the failure of a P or M2 unit might cause a different P 
to get interrupted. If this failure interrupt is sign- 
alled when the condition really doesn't exist, it is 
probably still wise to service the interrupt rather than 
ignore it. It is better to configure into a degraded 
mode of operation, for a short while, when it isn't nec- 
essary, rather than not to reconfigure when it is nec- 
essary. 

Other interrupts which are unexpected are not as- 
sociated with failures. Many are traps, such as absent 
segment trap conditions. The servicing of an absent 
segment trap condition when one doesn't exist can lead 
to inconsistent situations and ultimately system failure. 

One design feature, which can be applied to certain 
I/O interrupts , involves a handshaking or interrupt ver- 
ification concept. This feature would have the system 
verify that the interrupt which was signalled really 
does exist. The device which signalled the interrupt 
must retain the interrupt condition information until 
after the verification cycle. The verification can 
either be performed directly by the I/O unit or by a 
processor through an I/O command. 


7. 3. 3. 4 Non-State-D e pendent Sequences : If an I/OC or channel 

sustains a transient, which causes the termination of an I/O 
sequence, then it would be desirable to rely upon a recovery 
policy which would cause the reissuance of the I/O command. In 
order for this recovery policy to be satisfactory, the response 
of the I/O device to the command must be only a function of the 
command and not of the state of the device itself. This feature 
can be designed into the device if one is careful about the ini- 
tial design specification and the type of commands one allows. 

For example: Assume a tape unit is at the end of re- 

cord 6 of file 1. A command which says "Read the next record" 
is very dependent upon the state of the tape unit; namely the 
position of the tape. A better command structure would be "read 
record 7 of file 1". The result of this command will always be 
the same independent of the position of the tape. 
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It should bo clear the "Read the next record" would not 
prove to be a satisfactory command to reissue in case of a 
failure in the middle of reading record 7. Record 8 could be 
accessed instead of 7. 


7. 3. 3. 5 C om plete flos s age Buffer : If errors can be detected as 

soon as they occur and if recovery from transient errors is re- 
quired, both the I/GC and the device must have enough buffer 
storage so that a retransmission of the entire message (data and 
command) can be made. The I/OC buffer may, indeed, be M2 and 
the buffer storage element of the archival memory might be the 
tape itself. 

It is undesirable to have to recreate the entire message 
because of a channel transient error. Retransmission appears to 
be a reasonable approach. 


7. 3. 3. 6 Real Tim e Aspects : When the MP is used as an element 

of a real time control loop, outputs can be required periodically. 
If a failure occurs during a real time I/O command, the device 
could possibly have to wait for a number of iteration cycles for 
the recovery cycle to be complete. 

In this instance, the device must be provided with a 
capability to extrapolate from old updates until the system has 
recovered. This might require nothing more than assuming the 
last update is still valid. Possibly, more complex methods are 
required. 


7 . 3 . 3 . 7 Failure During the Execution of an I/O Command : If a 

transient occurs, the actions to pursue in order to recover be- 
come extremely device dependent. 

Consider the following examples: 

a) Many of the external devices attached to the space sta- 

tion data bus are transducers, to monitor temperature, 
pressure, gas mixture, etc. If an I/OC or channel fail- 
ure occurs , the appropriate action for recovery would 
be to ignore the results of the command in progress, 
clear the buffer or reset the device if necessary, and 
reissue the command. 

Any non-destructive read operation can be reissued 
for recovery purposes. Destructive read operations should 
be eliminated from the system specification or temporary 
redundant storage or redundant devices must be employed. 
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b) Consider the case of updating the refresh memory of a 
CRT output device. Assume a failure occurs during the 
update operation and the possibility of incorrect in- 
formation on the CRT exists. Recovery action can con- 
sist of nothing more than reissuing the update command. 
If recovery takes 100 ms the human operator might only 
notice a small flicker on the screen and no damage is 
done to the overall system. 

c) Consider the case of a printer. Assume a failure occurs 
in the middle of a print cycle. It should be clear that 
the reissuance of the PRINT command is inappropriate for 
recovery since the old printed output, possibly incor- 
rect, would exist immediately on top of the new valid 
printed output. Page boundaries would be incorrect. 
Before reissuance of the print command, the page must 

be spaced. If a plotter instead of a printer were being 
used, the computer operator would have to be informed 
to insert a new sheet of paper in the plotter. 

d) Inter-Computer Communication. Quite possibly, the space 
station will contain pre-processors in addition to a 
large central multiprocessor. Pre-processors are em- 
ployed so as to buffer the high bit rate of the device. 
(See Figure 7.6.) They perform high frequency inter- 
active calculations and provide a data rate reduction 
for the system. 

Unlike simple input output devices which can re- 
cover with reissuance of commands, a pre-processor re- 
action to a command can be very dependent upon its own 
state. 

All the concepts of command verification and mes- 
sage buffering, must be built into the pre-processor. 

The programs in the pre-processor must also be designed 
to run asynchronously from the multiprocessor. 

7. 3. 3. 8 I/O Locks: When a software process requires access to 

an I/O device , the device may required to be locked to the pro- 
cess. That is, no other process can access the selected device 
until the previous I/O request is finished. Problems of dead- 
lock exist when the initiating process fails. 

If the software process recovers quickly enough, then 
the lock does not remain on the I/O device for an excessive time. 
However, if recovery takes a long time or if the process is spe- 
cified to be non-critical (that is it need not recover) , then 
some mechanism must be designed into the system to release the 
I/O lock. This is one of the elements to consider in allowing 
a process to fail safe. 
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Figure 7.6: Pre-Processors 
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Even though the process need not c :i.nue operation, in 
case of a failure, a special I/O routine m be executed to 

search, find and release all locks created the process which 

terminated . 


7 . 4 The Implications of Fail Safe 

Although it is the physical hardware that fails, it is 
conceptually useful to consider the process being executed at the 
time of failure to have failed. Only one process can fail when 
a processor fails. In the case of an M2 module many processes 
can be affected. 

It is assumed that in the space station environment all 
processes are either required to recover or fail safe . None are 
allowed to be abruptly terminated without consideration of the 
interaction between the termination and the rest of the system. 

A number of problem areas arise when one considers the 
implications of Fail Safe. Some of these are discussed below: 

a) In order to maintain system throughput in a multipro- 
cessor, the intrinsic parallelism within a function 
must be exploited. Parallel processes are spawned and 
executed simultaneously on different processors. 

If a process is to fail "safe", all the fork points 
which were created must be examined and all the spawned 
processes terminated. This feature must exist within 
the executive function of the system which controls the 
termination of processes . 

b) If a process is to fail "safe", all the I/O commands 
issued by the process must either be cancelled, term- 
inated or completed. None may be left indefinitely 
on queue. The various commands issued to each device 
must- be studied to ascertain the effect of a premature 
termination of the issuing process. If a tape was in 
the middle of reading a record, the read cycle can be 
completed. Upon receipt of the "Done" indication, the 
read data can be discarded. If a command is still on 

an I/O queue, it can be cancelled. If a device is being 
written into, it is not clear that the write operation 
can continue when the initiating process is terminated. 
All these types of questions must be considered for 
each I/O device when one desires a process to fail safe. 

c) When a memory unit fails and a segment of a non-critical 
process is made dead, questions must be raised as to the 
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disposition of tho other valid segments within M2 and 
M3 associated with the failed process. The following 
suggestion is made: 

One of the conditions which will cause a process 
to be killed , will be when it attempts to access data 
contained within a. dead segment. When this occurs , 
control will be transferred to an executive routine 
which will control the operation of systematically 
terminating the procress. This includes: 

1) Placing the process in the dead state 

2) Placing all spawned processes in the dead state 

3) Releasing the stack number (in the case of a stack 
machine) and the space used by the process and de- 
pendent processes. 

Contained within the process stack are descriptors 
of all the local data segments currently being used by 
the process. The space used by these segments must 
eventually be reclaimed for other uses. 

During the normal execution of the memory management 
function, any segment not referred to within a period, 
of time will be replaced by more active segments. This 
includes any dead data segments that may exist. Even- 
tually, all the dead segments in M2 will be overwritten 
by just letting the system run normally. However, it 
is possible for dead data segments to occupy space on 
M3 which could possibly be used for other segments or 
for file storage. 

At some point a "Garbage Collection" routine will 
have to be executed in order to reclaim this lost space. 
Most probably, the normal reclaimation of fragmented M3 , 
due to M3-M4 control, will provide the required service. 

In general, the executive design must consider the actions 
to take when a process enters the dead state. If an 
interrupt is directed to a process which is in the dead 
state, it should be ignored and any other process which 
is dependent upon the dead process must be informed so 
that appropriate action can take place. 
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Chapter 8 

CONCEPT VERIFICATION 


8.1 Background 


ned 
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This 
propo 
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The multiprocessor (MP) system proposed for ruture man 
pace stations will employ many new concepts which will hope- 
enhance the performance and reliability of the system, 
chapter will discuss the validation of various concepts 
sod for the space station MP . The concepts to which refer- 
is made are not applications software or SUMC hardware but 
r those aspects of the system which interact with applica 
software and SUMC hardware to control the operation of 
station subsystems and experiments. One wishes to verify 
the ideas which will be implemented do indeed yield the 
red performance with an efficient utilisation of resources. 


How does one go about validating a new concept, or at 
least establishing confidence that a given approach will prove 
satisfactory? The ultimate answer is to build the system, run 
it, and evaluate its performance. This of course is an expen- 
sive process, especially if many new ideas have to be frozen 
into a design before it is evaluated. In order to provide a 
more orderly, cost effective approach a two level simulation is 
proposed, both levels being carried out before the system is 
committed to operational use. 


This chapter will discuss both a high-level and a more 
detailed low-level concept verification process. 

a ) The first verification phase involves both analytical 
techniques as well as a high-level computer simulation 
employing idealized work loads and environments. The 
results of this effort will verify that a given design 
concept can achieve specified qualitative goals. 

b) The second phase involves a more detailed, low-level 
simulation requiring both simulated and actual hardware 
and software modules. The objective of this phase is 
to verify quantitative goals, by means of measurements 
and design modification. 
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As part of both verification steps, measurements are 
made and design parameters are modified so as to optimize sys- 
tem performance. The specific activities involved in the de- 
sign verification and performance optimization of the space 
station multiprocessor concept will be presented in the remain- 
der of this section. 


8.2 Phase 1 -- I nitia 1 Ana lv s is and H in h-Level Simulation 

8.2.1 Objectives 


The initial analysis and high-level simulation attempt 
to achieve the following objectives: 


8. 2. 1.1 Design Features : The major design features must be 

established. In a HP system this will include: 

a) A definition of memory management philosophy 

b) The appropriate utilization of local memory 

c) Interrupt and I/O analysis 

d) The structure of the MP internal bus 

For example, the application of simple analytical techniques 
will demonstrate the inappropriateness, from a performance stand- 
point, of a single 32 MBPS internal bus which is time-shared be- 
tween P's and M2 elements. 


8. 2. 1.2 Parame t ers : The parameters which should be made vari- 

able in the low-level simulation must be identified and segre- 
gated, so that performance can be optimized. For example, the 
simple analysis of local memory and its effect upon performance 
indicates that the major parameters are the M2/M1 speed ratio, 
r, and the hit ratio, h. 

The isolation of these parameters is significant in that 
performance improvement or degradation is very sensitive to h. 
Clearly . those hardware and software elements which control h 
should be made as variable and flexible as possible. 

If a virtual memory is employed a simple, high-level 
simulation or analysis will show that the following parameters 
should be made variable: 

a) The page size (if paging is employed) . 
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The replacement algorithm. 


b) 

c) If an associative memory is proposed, its size should 
be variable. Performance is very sensitive to the 
search time of the page location algorithm. 

d) Possibly, the utilization of a variety of different 
access times to M3 devices should be considered. 

e) The size of M2 could be a parameter. The "thrashing" 
threshold has to be established if software expanda- 
bility is to be achieved. 

The main objective of this effort is to isolate as 
many parameter.'’, of design as possible through a careful scru- 
tiny of all major design features. 

8. 2. 1.3 Ass um ptions ; Another objective of this first phase 
effort is to establish clearly all the assumptions, implicit 
or explicit, that formed the basis of major design decisions. 
For example, why was a multiprocessor chosen? Three answers 
are possible: 

a) A cost effective performance increase. 

b) Reliability improvement through the use of identical 

elements and an ability to recover. 

c) Expandability. 

All three of these assumptions or desires drives one 
to the conclusion that the executive system, which interfaces 
the hardware and applications software must be generalized 
enough for expandability, yet it must be implemented in such 
a way as not to produce an excessive overhead. Reliability 
implies a comprehensive error detection scheme. Recovery im- 
plies a specific communication interface between the hardware 
and executive. 


8.2.2 Tools for High-Level Simulation 


8. 2. 2.1 Simu la tion in General : How does one approach the pro- 

blem of developing a high level simulation? What tools are 
available? Reference 1 discusses techniques available for both 
macro (high level) simulation and micro (detailed low level sim- 
ulation) . Macro level simulation is concerned with abstractions 
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of computer systems which are designed to expose and analyze 
critic£il design parameters. Generally speaking, these simula- 
tion techniques deliberately suppress design detail, and con- 
centrate on broadly defined measures of system effectiveness . 

Computer simulation at this level has its basis in 
queueing theory, the probabal istic analysis of the interaction 
between users and facilities. The role of simulation is to 
exercise user and facility interactions whose complexity ex- 
ceeds the bounds of known or feasible analytic solutions, by 
Monte Carlo methods. 

Digital computer facilities have long exhibited the 
symptoms dear to the queueing analyst: namely, bottlenecks. 

The reader will probably have personal familiarity with situa- 
tions where a data processing facility has become hopelessly 
inefficient due to one, or a combination of, bottleneck elements. 

The objective of high-level simulation is to obtain 
an advance estimate of the performance of a computing facility 
at the design stage. To be successful, the simulation must 
anticipate the way the system would work if it were built. The 
successful simulation designer must accomplish all of the fol- 
lowing steps: 

a) He must satisfy himself that simulation is an appropri- 
ate analytical method, and that the elements of the 
system and the job stream are sufficiently defined. 

b) He must verify that the results of the simulation are 
correct, and that they are appropriate to his purpose. 

c) He must explain and substantiate his results and pro- 
selytize his conclusions in order to affect future 
events in a constructive way. 

These generalizations are noted here because there 
seems to be an uneasiness among professional personnel about 
high-level simulation of computers. This is probably because 
the technique of simulation has been often misused, particularly 
by neglecting the fundamentals listed above. 


8. 2. 2. 2. GPSSg A generalized macro simulation language GPSS was 
deve loped "by Gordon [2] of IBM. GPSS deals in transactions, 
events, facilities, storages, and queues. A transaction is 
generated for each element in the job stream. Events mark the 
movement of the transaction through the system of facilities, 
storages and queues. A facility is a system element that can 
accomodate only one transaction at a time. A storage is a 
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system element that can accomodate many transactions up to a 
specified limit, at a time. A queue is a waiting line. Gordon 
cjives examples of these concepts as they might occur in differ 
ent systems : 


Type o f S y r. i c in 

Transaction 

Facility 

Storage 

Coinmuni c at 1 on s 

Me ss age 

Switch 

Trunk Lines 

Transpor t a t i on 

Car 

Toll Booth 

Road 

Pclt-ci i.’ j- G vJO.v iig 

Record 

Key Punch 

Memory 


There have been at least two efforts to develop spe- 
cialized simulation language for computer systems. These lan- 
guages are CSS II [3] and IMSIM [4]. 


8.2.2. 3 CSS II: This simulator was developed by IBM to support 

its o wn system analysis needs, and to aid in analysis of custo- 
mer facility requirements. 

IBM now provides CSS II as proprietary software on a 
rental basis. CSS II is similar in concept to GPSS but differs 
in one important aspect: it is not general but applies speci- 
fically to computer systems. Thus its language speaks in terms 
of tape units, disk files, communication lines, and terminals, 
and provides instructions for the modeling of programming systems 

CSS programming consists of a specification of system 
elements, a specification that generates job streams, and spe- 
cification of- the logical operations to be performed on the job 
elements. Its generality is enhanced by permitting a more or 
less complete construction of both the system hardware confi- 
guration and the software operating system, to a level depen- 
dent on the user's needs and interests. 


8. 2. 2. 4 IMSIM: IMSIM was developed by Systems Development Cor 

poration for the NASA Manned Spacecraft Center. It presents a 
less general approach to computer simulation, in comparison to 
CSS, because user constructions are confined to the preparation 
of input tables which define the configuration of computer system 
elements and the job stream. The algorithms that define the 
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software operating system cannot be modified, except 'for a few 
switch . setting choices. The operating system programmed into 
] MS IM includes the capability of simulating priori ty— dependent 
multiprogrnmmed and multiprocessor computing systems. IMSIM 
is supported only at the Manned Spacecraft Center, NASA. It 
is written in Modlit, a language similar in many respects to 
GPSS . 


8 . 3 Phase 2 — Low-Level, Detailed, Mixed Simulation 

The attractiveness of high-level simulation lies in 
its ability to discover major conceptual flaws before the de- 
sign is committed to hardware and before the operating system 
software is frozen. Hopefully, this effort also builds confi- 
dence in the system concepts at a low cost. The major short- 
coming of high-level simulation is that design flaws may have 
been obscured due to simplifications in the models employed. 

The low-level simulation employing various degrees 
of real hardware, software and a simulated environment will 
provide a more definitive verification of system performance, 
albeit at a significantly higher cost. The CVT program pre- 
sently being carried out at MSEC is an example of a simulation 
with a real computer and data bus. The space station environ- 
ment and typical work loads will, however, have to be simulated 
by artificial means. 


8.3.1 The Simulation Process 

The simulator is a device (both hardware and software) 
which provides the developer of the system with overall exter- 
nal control of the system being tested. The simulator provides 
hardware and software required for specifying, monitoring, and 
testing the system under well controlled conditions, Reference 
1 describes the simulation process which can be organized into 
four factors as shown in Figure 8.1. These are: 

a) the user (USER) , 

b) the simulator itself (SIMULATOR) 

c) the computer system being simulated (SYSTEM) , and 

d) the simulation output (OUTPUT) . 

Let it be made clear that the SYSTEM being simulated 
may be implemented as either a complete software effort on a 
host computer or it may contain certain elements of real hardware 
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and software. There are advantages and disadvantages to both 
approaches. These shall be made clear in the discussions to 
follow . 


The geometry of the logical partitions in the simula- 
tor is shown in Figure 8.2, and the physical control is shown 
in Figure 8.3 following. The control path labeled A in the 
two figures provides the user with the capability of specifying 
the load module to be simulated, start- location and initial 
SIMULATOR clock setting, the maximum allowable SIMULATOR clock 
setting (to assure run termination) , the configuration of the 
SYSTEM (levels of redundancy, numbers of spares, initial fault 
states, etc.) information relative to automatic reconfiguration, 
illegal instruction detec Li on, execution of instructions in 
read/write memory, etc. 

The primary control, which the USER specifies, follows 
path B. By this path, and the return path C, the USER will be 
capable of ordering entry to routines which he provides , upon 
the occurrence of events or situations he specifies. The 
trigger-directives can include time conditions, location refer- 
ence (instruction or operand access) , and state changes (I/O, 
interrupt, hardware error detection signals, etc.). Once his 
routines have been entered as a consequence of a trigger direc- 
tive, the USER is capable of accessing all locations, registers, 
states, and conditions in the SYSTEM, and modifying them as he 
sees fit. Through an interface language, the USER may implement 
actions based upon conditions of almost arbitrary complexity, by 
simply programming the testing of these conditions in his rou- 
tines . 


Control paths D and D' provide information for OUTPUT, 
such as trace, flow-trace (output produced by branches only) , 
interrupt-occurrences, faults, or output directly from the USER. 

Information is not required on path "a" since the USER 
only interacts with the SYSTEM once the run starts and needs no 
interaction with the SIMULATOR. Figure 8.3 shows that the SYS- 
TEM is actually implemented within the SIMULATOR, and that the 
control paths to it actually interact via the SIMULATOR. 

Path E of Figure 8.3 represents the closed-loop dynamic 
flow capability which the USER can exercise within his interface- 
language routines. These routines may, in turn, call routines 
prepared in other languages to perform further processing. Us- 
ing external routines via this path allows the convenient addi- 
tion of a data- recording capability to the system to allow post- 
run processing and the addition of almost any conceivable envi- 
ronmental model. 
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Figure 8.3: Simulator Physical Control Flow 
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8.3.2 


Simulator Design Issues 


8. 3. 2.1 User Interface: For any simulation effort to be success- 

ful the user - or "experimenter must be provided with a capability 
of exercising complete control over the simulation from beginning 
to end. This control includes the ability to: 


a) 


b) 


c) 


d) 

e) 


Specify all initial conditions including default con- 
ditions, before the simulation is run. This includes 
the ability to specify the contents of memory locations, 
control bit and processor registers. 

Specify the work load to be run in the system, includ- 
ing hardware elements to be used. 

Specify the environment to be simulated, including ex- 
tra-ordinary events such as failures. 

Specify the outputs to be generated and reported. 

Specify modes of operation, the ability to roll back, 
and snapshot times. 


8. 3.2. 2 Work Load : The simulation of the processing unit or 

employment of real hardware is only the fii st step in the sim 
ulation of a computer system. In order to provide meaningful 
information on complex system interaction a "work load" for the 
system must be specified. For the SUMC MP this will include a 
reasonably complete sot of actual or simulated applicacions 
software modules as well as the real executive system. 

If one attempts to evade the issue of generating a 
realistic work load, many important design factors may be over- 
looked. For example, if a simulated work load is generated by 
a collection of subroutines, each one occupying a given amount 
of memory space, and a specified execution time (as simulated 
by a countdown loop) , the information concerning instruction 
frequency is lost. Also, since memory requirements for each 
subroutine are assigned arbitrary values, many factors con- 
cerning memory management become distorted. 


It is suggested that an effort be made to generate 
the real application software to be used as the work load. 
Space qualified software is not required for a system simula- 
tion. Therefore, the use of real applications software, to 
the extent permitted by the simulator's limitations, may be 
less difficult than trying to generate a realistic model of 
the work load. 
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Because of the interaction between h 
executive, it seems only reasonable that the 
model must contain as many as possible of tl 
real executive. A large number of the pararr 
which will be modified because of the simulc'i 
implemented in the executive software. 


rdware and the 
n-cutive system 
.atures of the 
rs or algorithms 
•n experience are 


8. 3. 2. 3 Th e En v i r o nm o n t : In simulating aerospace computer 

systems, the work load must often interact with the spacecraft 
and its environment. For example, navigation programs must re- 
ceive accelerometer inputs before they can correctly update 
vehicle position and velocity. A high degree of similarity must 
be maintained between the real and modeled environments so that 
the simulated computer can be subjected to computational loads 
and dynamic situations closely approximating the conditions of 
the actual mission. 

The simulation environment developed for the Apollo 
Guidance, Navigation and Control System included modeling space- 
craft dynamics, engines, optics, astronaut interactions, atmos- 
pheric and gravity effects, motions of celestial bodies, etc. 

For the 8UMC MP , the environment cannot be simulated 
within the SUMC itself. This would distort memory management, 
I/O, processor allocation and real time factors. The simulated 
environment must be provided by external equipment. For example, 
the I-I316 computer can provide such a vehicle by simulating the 
data bus and all its peripherals. If a real data bus is employed 
with a limited amount of real avionics equipment then the H316 
could be interfaced to the data bus to simulate those equipments 
which are impossible to exercize satisfactorily in the labora- 
tory (e.g., IMU’s, fault detectors within BITE) . 


8. 3. 2. 4 Measurement of the System Under Test : The accumulation 

of statistics and the output presentation of this data are the 
ultimate result of any simulation result. If a real computer 
is used instead of a simulated model then a major problem can 
arise due to the lack of computer memory capacity for trace and 
dump routines and data. If the memory is used for trace and 
dump data then it cannot be used to process the workload. The 
results of the simulation run will therefore be distorted. 

A secondary problem also arises in that real time aero- 
space computers usually do not possess a full complement of high 
speed record recording equipment, such as card readers, high 
speed printers, or tape units. The attachment of these equip- 
ments could also distort the results since they put an abnormal 
load on the I/O. 
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A complete software simulated system will not suffer 
the problems mentioned above since time and memory space are 
also simulated entities. When real hardware is used within the 
simulator, it is difficult to compensate for inadequate memory 
or the loss of real time. 

Three features which were incorporated into the Apollo 
computer simulator are presented below as examples of the inter- 
action of the simulator and the simulated system. These inter- 
actions imply that if a real SUMC MP is to be employed as an 
element of the simulated system, a design effort must be under- 
taken to provide the correct "hooks" into the hardware so that 
useful results may be obtained. 

A useful feature to be used in microsimulation is 
rollback [5] . Long missions such as Apollo require 
simulation time on the order of hours. Should the host 
computer (on which the simulation is being executed) 
malfunction, the simulation will abnormally terminate. 
Upon restart one does not want to go back and duplicate 
the execution of this simulation from the beginning of 
flight. By establishing rollback points in the simula- 
tion this problem is avoided. At rollback times com- 
plete core and register dumps are taken, and this infor- 
mation is put on a secondary storage device. Then upon 
system failure the simulation can be restarted at the 
last rollback point by loading memory with this stored 
information. The overhead associated with rollback is 
well justified with long simulations, such as Apollo. 
However, to prevent this overhead from becoming too 
high the system designer must decide upon a judicious • 
criterion for establishing rollback points. That is, 
he must trade off the cost of frequently storing roll- 
back information with the savings in not having to re- 
simulate a large part of the flight. 

b) Stress Testing 

Stress testing can be provided in a simulator to 
help determine if combinations of application programs 
will exceed their combined time budgets under the exe- 
cuted conditions of operation. This request reduces 
' the speed of the object computer. If a group of appli- 
cation programs is run in a simulation with a computer 
whose speed is, say, 75% of the real computer capabi- 
lity, successful operation may be interpreted to mean 
that no more than three- fourths of the computer capa- 
city has been absorbed. This special request can thus 
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be used to "diminish" the capability of the computer 
until a point is reached where timing requirements 
are not satisfied. This level then is a guide to the 
amount of computer capability still available for 
other software. 

Stress testing can also be used to verify the 
"thrashing" threshold of memory management. If the 
amount of available memory is reduced but the workload 
and environment are held constant then a measure can 
be obtained as to how much excess memory is available 
for multi-programming. 

c) The Coroner Request 

A "coroner" special request can be implemented in 
a simulator for post-mortem diagnosis. The request 
causes storage of information from each simulated in- 
struction in a circular buffer of size n. If the run 
abnormally terminates, a list of the last n instructions 
simulated is produced. This list is a valuable aid in 
determining the reason for the abnormal termination. 
However, the overhead associated with this request re- 
quires that it only be used when its cost is outweighed 
by the enhancement of debugging efficiency. 

d) Knobs and Dials 

A system simulation is undertaken not only to 
verify specific design concepts, but also to make per- 
formance measurements under various parametric condi- 
tions. In order to achieve this objective the system 
(hardware and software) must be provided with enough 
flexibility (knobs and dials) so that the various de- 
sign parameters may be adjusted. 

Although the details of the SUMC MP have not been 
published by MSFC a number of suggestions can be made 
concerning those entities which should remain as vari- 
able parameters during the simulation. Implicit in 
the following listing are obviously a number of assump- 
tions which, if incorrect, could make the variable un- 
necessary. For example, if a management directive ex- 
ists that only two processing units are to be employed 
with no concern for future expansion then a number of 
problem areas associated with multiprocessor design 
degenerate into trivial solutions. 

The following list describes some of the design 
parameters which should be kept variable during the 
low level simulation process. 
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1 ) 


Operating Memory (M2) 


Assume a paged virtual memory concept is 

employed. The following items should be adaptable 

in order to optimize performance. 

i) Page size 

ii) Page replacement algorithm 

iii) Page presence algorithm. If an associative 
memory is employed to determine the presence 
of a page in M2, then the number of words in 
the associative memory should be made a para- 
me ter . 

iv) Total size of M2 storage as well as the number 
of M2 modules. 

v) Possibly the speed ratio between M2 and M3. 

2) Processing Unit and Local Storage (Ml) 

i) Instruction architecture. A measure of in- 
struction frequency will indicate which in- 
structions are not needed. Similarly the 
measurement of subroutine usage of various 
control features will indicate which instruc- 
tions need to be incorporated into the de- 
sign. 

ii) Depending upon the use of Ml its size should 
be variable. 

iii) The algorithm used to assign processes to 
processors should remain a variable as should 
most of the executive functions dealing with 
resource allocation. 

3) Communication 

i) The P-M2 internal bus width and rate should 
be changeable especially if a bottleneck is 
anticipated, based upon phase 1 simulation. 

ii) The communications link from processor to 
processor as well as from processor to I/O 
should be made flexible so that the traffic 
capacity can be increased if a bottleneck 
is discovered. 
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Chapter 9 

CRITIQUE OF SUMC's ARCHITECTURAL CHARACTERISTICS 


9.1 


in orclo' 
the imp] 

but rati 
hardwa re 
with a c 
of goals 
which if 
develop]; 

a) 


b) 


c) 


d) 


D esig n Co als 

A critical evaluation of the SUMC design is provided 
• that future efforts may have the benefit of the present 
,, :r g :Si *i'his critique will not be primarily directed at 
lamentation aspects of the circuit and/or logic design, 
ior at the higher level architectural features of the 
. . An evaluation of any design must of necessity rest 
loterm.i nation of how well the design approaches a set 
Therefore, a set of design goals is now presented 
; lntermetri.es' interpretation of MSFC's desires in the 
nent of the SUMC project. 

The MSFC desire to use a basic SUMC hardware design on 
a wide variety of missions, which will require a wide 
range of computation power, leads to the requirement 
for a hardware design which is expandable. "Expandabi- 
lity" should be considered with respect to such features 
as word length and sizes of the various memory and pro- 
cessing structures, including the micro memory, scratch 
pad, ALU, multiplexers and main memory. 

The variety of application requirements leads to a de- 
sire to create an architecture which is flexible and 
adaptable to changing conditions. For example, the 
instruction set should be able to be modified or 
changed. Similarly, components should be able to be 
utilized within the same architectural structure re- 
gardless of their execution speed. As various tech- 
nologies improve, this then allows the smaller and 
faster logic and/or memory elements to be incorporated 
into the design with a minimal impact. 

A specific requirement of the SUMC expandability and 
adaptability design is the ability to utilize the de- 
sign as either a stand-alone uniprocessor or as a 
larger multiprocessor system. 

The "U" in SUMC stands for "ultra" reliability. This 
must not only include the ability to operate for a long 
period of time without failure, but also (from a 
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practical point of view with respect to current tech- 
nology) , must indicate the ability to detect failures. 
The detec Lion of failures is required .if a multipro- 
cessor is to constrain error propagation and possess 
the ability to reconfigure. 

Since the SUMC family of computers is meant primarily 
for aerospace applications, the conservation of weight 
and power becomes of primary importance. 

Keeping m mind these different critieria, the follow- 
ing sections will examine various aspects of the SUMC design. 
Not all of the aspects are independent of each other, but they 
are presented in such a manner so as to highlight different 
points of view. 


9 • 2 Mi cro Instruction Sequencing 

In a . microprogrammed machine where flexibility is one 
of the objectives, it is extremely important that the micro 
sequence control itself be flexible. The sequencing control 
presently available in SUMC is described in Figure 9.1. The 
only control actions possible are 

a) stepping thru the micro code (0., 1., 2., 3.) 

k) branching to a location described by 

1) an ALU output (4.) 

2) associated with an opcode (5.) 

3) given in the micro code (13.) 

c) alternate choice in either 

1) branching or holding (6., 7.) or, 

2) branching or stepping (8., 9., 10., 11., 12., 14., 


. Although these forms of sequencing do allow the genera- 

tion of. a static set of linked micro code, they do not allow for 
easy modularization of micro code. 

While this feature becomes particularly important when 
the instruction architecture contains powerful semantically 
concise operations, it is also extremely important with stand- 
ard current forms of instructions. The execution of an 
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CONDITIONS 

SEQ. ACTION 

I.C. ACTION 

Binary 

Code 

None 

+ 1 

Hold 

0000 

None 

+ 1 

IR (26- 31) >IC 

0001 

None 

+ 1 

MROM (Cll-CJ 6) ->TC 

0010 

None 

+ 1 

PRM (26-31) >IC 

0011 

None 

PRM (22 31)->SEQ 

Hold 

0100 

None 

I A ROM SEQ 

Hold 

0101 

IO0 

Hold 

IC - ltLC 

0110 

o 

6 

H 

KROM (C7-C16) 

Hold 


IC-4 

Hold 

-4 

0111 

IC<4 

MROM (C7 - C1G) 

Hold 


CNT = 1 

MROM (C7 - Cl 6) 

Hold 

1000 

CNT - o 

+ 1 

Hoi d 
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CNT -- LALU overflow or ALU overflow or DEX3 as specified by ACCS, CNT 
field 


ACCS - 1 
ACCS = 0 


+ 1 

M.ROOM (C7-C16) 


Hold 1100 

Hold 


ACCS - 

PRM sign or ER sign ns specified by 

the ACCS, 

CNT field 

None 

MROM (C7-C16) 

Hold 

1101 

IC>0 

+ 1 

-1 

1110 

o 

il 

a 

H 

MROM (C7-C16) 

Hold 


IC>4 

+ 1 

-4 

1111 

IC<4 

MROM (C7-C16) 

Hold 



Figure 9.1; Control Conditions and Actions 
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instruction can be viewed as occuring in three phases: instruc- 

tion fetch, operator decode, and execution of the operation. 

The SUMC allows for common manipulation of all instruc- 
tions in both the instruction fetch and operator decode phases 
of execution. It is interesting to note that after the instruc- 
tion has been fetched, the memory operand is fetched, if the 
operator is of the appropriate "class" of instructions. This 
differentiation is performed by the hardware and is completely 
dependent, therefore, upon both the instruction architecture and 
its physical bit mapping. There is no general way to have sev- 
eral classes of instructions, each with its own idiosyncrasies, 
without this special hardware help. This is because the decision 
on whether or not to read memory must be performed in the "com- 
mon" section of code. 

If there were to be the ability to call and link in the 
micro code, then the question as to whether to read an operand 
from memory could be decided after the operator had been decoded 
and the execution of the operation had been entered. 

(The one current possibility for modularization within 
the SUMC micro code would be: 


a) 

Place the return micro address in the PRM 


b) 

Branch to the micro sub-routine 



c) 

Upon entering the subroutines, save the 
in the SPM 

return 

address 

d) 

To return, gate the return address from 
and into the SEQ. 

SPM to 

the PRM 


This would effectively take four micro words.) 

Besides the desire for micro code modularization for 
complex instruction sets, the next section will point out the 
need to be able to do much more micro condition testing for 
sequence control . 


9 . 3 Choosing Functions to Optimize 

It has been observed that the SUMC hardware has been 
optimized for the implementation of the multiply, divide and 
square-root operations. However, what is the actual expected 
percentage of occurence of these operations? In particular, 
what is the frequency of distribution- of all the implemented 
machine instructions? 
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C.C. Foster e 
on the CDC 3600 and ha 
the? compiled code eont 
reran n ning instructions 
linking and various ot 
mereial type of applic 
me tic instructions fel 
arj thine tic operation w 
gram with the most a r x 
were .‘lor 


t. al . [1] has made a 

found that in sc lent 
a.inod only 10% arithmc 
were involved with To 
her control operations 
ation, the total perce 
1 to less than 5% [2]. 
as clearly addition, 
thmetic functions, mul 


study of OP code usages 
ific Fortran Programs, 
tic instructions. The 
ad, store, subroutine 
. For the more com- • 
ntage of all arith- 
Tho most common 
Even for the pro- 
tiply and divide 


C.C. Church [3] states: 


"In instruction occurence we found arithmetic 8 .j percent 
arid jui.tps 12 percent . What are wu doj ng with tm? reee of 
the commands? Obviously, v/e need the "Data. Move' func- 
tion, but do flow charts call for anything near 40 per- 
cent? And what of the transfers: My flow charts do not 

call for anything near 23 percent of the problem to be 
involved in transferring." 


While those types of statistics can be interpreted as indica- 
ting a mismatch between the problem to be solved (i.c., the 
program) and the operations provided (i.e., the machine in- 
structions) , they can also provide insight into the assign and 
implementation of instruction sets. If the instruction set- 
provided is of the current machine level form (e.g., IBM 3bu; 
then, for example, the multiply and divide instructions are 
not driving design features. If these instructions sre 1 

less than 2% in occurrence, then their optimization and reduc- 
tion of their execution time by half will only save l-» of the 
overall execution time. On- the other hand, an optimization of 
branches by half their execution time would make a dramatic 
savings in actual execution time. 


While it is understood that certain data reduction or 
filtering problems do require an above normal amount of multi- 
plication, this is not a common occurence and the multiply and 
divide instructions should not form the basis of the machine 
architecture . 


If one takes the EUMC JZ (Jump Zero) instruction for 
an example, (Figure 9.2) it can be seen that not only can it 
be made faster, but the number of micro instructions can be 
reduced if necessary conditions to be tested are generated by 
the hardware. The testing of conditions is indeed the method 
of determining control flow through an algorithm, and there 
fore, will always either have to be in some fashion artificially 
produced or explicitly tested. The cost of providing these ex- 
tra dynamic conditions is small when compared to the gams in 
execution time and savings in micro-memory. 
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It can be noted that further savings can be had m the 
revised JZ flow chart (Figure 9.3) by either placing tie 
(MAR) - !>- (PC) function as a special entrance to the I rou , 

since it occurs in several SUMC instructions, or for thi. oan.e 
reSnon have this action as part of a sequence control state. 


9.4 


F i e ]. d Jian ipul ation 
s.i.iu'j rmd 


M 3 s k i n g s — Shift ing _ Sit Add res 


” ShTf td nq 


The wo rd length of a computer is often chosen because 
of arithmetic Precision and, cotemporancour. I.y , the instruction 
format si zo bnce chosen, this word length then becomes an 
Suficlh mi.mtia of oddreosability. Thaw is tho, case with_ 
th-> S r JMC The 1 mo 'l orientation or an mstrucaon sou of t.en - - 
out :cs the SiMont manipulation of variable 

bit mnnipul ati on and testing. While the ..UiC can a . 
“iS all these functions at a Macro leva »y -ing sh Uty 
and logical instructions, it is suggested that jfjk^^ions 
is to be obtained the high frequency or use of u’^thev should 
in various instruction architectures requires that, they 
be more directly under Micro level control. 

The SUMC does recoqnize this fact, in a limited way, 
bv providing in the hardware the extraction of mantissa, char 

irnl bits reemi red for the executive functions of the SUMC 
not given special hardware since they are not known m advance. 

What is desired is a generalized bit manipulation, 
m cVing field insertion and extracting mechanism which can 
bef micro controlled. In the actual implementation of a part 
inqfrurtion set for a particular mission, it is rec g 
nbbd that this generally could be specialized in order to op 
e 'the actual usage/ An example of the need for testing 
of certain bits efficiently would be ifit 

element indirect addressing and hence the indirect oit oi 
the operand would have to be efficiently known during the ef 
fectSo memory address calculation. Besides changes in the 

meaning of instruction fields, it c ° uld "} S ° f ^ s P °f S cSrrent 
realize- other data types or other physical forms of 

data types. 


9.5 


in a 


Limited Scratch Pad Addressing 

The philosophy of a generalized register set contained 
scratch pad structure is very good as far as providing 
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If (A) —0 , GO TO Z 


If (A) ^0 , TO GO NI 



Figure 9.3: Micro Program Flowchart 

(Jump Zero) Revised 
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adaptable design. It would be desired on the hardware level to 
allow any register to serve any function. The specification, 
therefore, as to the assignment of registers would be contained 
within the micro code. The location within the scratch pad of 
the macro level program counter, base register, etc. , should be 
specified by the micro program and not dictated by an arbitrary 
hard wired location. One can easily conceive or instruction 
sots with automatic base registers or none at all, or with a 
return address stack. The present SUMC design does not allow 
th i s gene r a 1 i z a t i on . 


The internal interconnection of the scratch pad ad- 
dress register (SPAR) to the Instruction Register (IR) and the 
mi ero memory buffer register indicates that addressing of the 

bu urea p.-.^-* ib IX comp-o Lexy u i. i.. uic v; i.-.' Oil. .1.1.1 ■ < • t ■' d - - •••’ 

specified in advance within instruction code or micro memory 
code. The ability to dynamically deduce or calculate a scratch 
pad address is not possible because the SPAR can not be loaded 
from one of the SUMC ' s internal registers, such as the PRR, 

MQR or MAR. The dynamic determination of scratch pad address 
would be required if one wished to implement a stack within the 
scratch pad. 


9.6 Mic ro and Main Memory Speed Ratio 

The current T 2 L version of SUMC operates with a micro 
memory cycle time of 330 nanoseconds, while main memory pos- 
sesses a 6G0 nanosecond cycle time. It is suggested that the 
speed ratio between micro and main memory should be closer to 
5 or 10 to 1 instead of 2 to 1 . This becomes especially de- 
sirable when an instruction set is more complex and semanti- 
cally powerful than the IBM 360 instruction set. In more power- 
ful instruction sets, one finds both: 

a) an instruction operation specified in fewer bits, and 
hence memory does not have to be read as often, and 

b) the operations to be performed are themselves more 
complex and therefore take more computational steps. 


9 . 7 Ma in Memory Synchronization 

While reviewing the micro code flow charts, it was 
observed that the processor or micro memory cycle time was 
synchronized to the main memory cycle time by executing micro 
level NOPS . The main memory cycle time therefore was an in- 
tegral part of the micro code. This can be disastrous for two 
entirely different reasons. 
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If 3- slower or faster main memory were employed many 
changes would be required in the actual micro code. 

b) In a multiprocessor one can not determine the exact 

time between a memory request and the response, since 
the addressed memory module might be busy with another 
processor and the request might take a number of memory 
cycles to be satisfied. 

Multiprocessors, therefore, can not guarantee their exact re- 
sponse time with respect to memory. 

What is required is a completely asynchronous operating 
memory interface where the execution of micro code and memory 
timing are not intertwined. 

In a multiprocessor environment it is necessary that 
a process be able to read the contents of a memory location 
and change its value by writing into it all in one period of 
time at the exclusion of all other processors. This form of 
read/write mechanism must be provided by any potential multi- 
processor. 


9 • 8 Limited Modularity C oncept 

The "M" in SUMC, which stands for modularity, seems 
to extend only to the packaging of arithmetic and register func- 
tions into 4 bit entities. The concept of modularity can be 
extended to the higher level of internal architecture by pro- 
viding an internal structure which is organized around 1, 2 
or 3 buses. These buses aliow all the internal structures to 
communicate between one another. As needed, new structures 
may be added, such as a floating point unit or an associative 
memory unit. Most present day mini computers (see Figure 9.4) 
are designed around an internal bus structure. 

This' concept can be extended as in the MLP 900 (IC 
9000) which also provides what are called program cards. These 
are hardware modules addressed by micro memory to provide spe- 
cific hardware functions. 

Mini computers such as the HP 2000 series, PDP-11, 
MODCOMPI , GRI909, etc. , are all built around an internal bus 
structure . Often it is this internal bus structure which en- 
ables the system to expand and contract to meet varying re- 
quirements. 

The "M" in SUMC is severely limited with respect to 
this described form of modularity. 
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Figure 9.4: Generic Mini Computer 









9.9 


The "U" in SUM C - Ultra Re liability 

Reliability, clearly requires "good" components. The' 
SUMC program does attempt to achieve component level reliabi- 
lity by experimenting with advanced state of the art component 
and packaging and fabrication techniques. Reliability is one 
of the major design goals of the SUMC architecture. This being 
the case, it is surprising that the architecture of the SUMC 
does not consider hardware detection of the major fault condi- 
tions of integrated circuit implementation. The packaging and 
definition of the modules should consider the effect of failure 
and should attempt to make detectable failures more statisti- 
cally independent. For example, integrated circuit modules 
should tend to be more bit oriented than function oriented. 

It is necessary in reliable systems to have "immedi- 
ate" fault detection within the hardware in order to prevent 
propagation of errors. The interaction of transient faults and 
the micro execution of instructions must be carefully considered, 
and made part of the basic structure. 


9.10 Confusion Between Desicrn Levels 


A basic philosophical comment seems appropriate. A 
truly modular design should possess maximum independence be- 
tween design levels. That is, the architecture (block diagram) 
level, instruction definition level, and the implementation 
(logic design, circuit technology) level should be approached 
as independently as possible. A change of definition at one 
level should not cause major impact on the other levels. 

When flexibility is desired the implementation archi- 
tecture should be generalized enough to allow the implementation 
of a wide variety of instruction sets. This is particularly 
true when one considers a large future time framework. While 
most current instruction architectures are similar to the 360, 
they will become more and more problem oriented such as the 
Burroughs B6700. The instruction set should reflect the major 
application to which the system is to be used. For example, 
when a Higher Order Language is employed, the instruction set 
should be so specified as to aid in the generation of, and 
hence the efficient execution of, compiled code. 

Similarly, the introduction of new technology at the 
implementation level of design affects speed, weight, power and 
cost, but should have no major impact upon the instruction set 
or (processor, memory, I/O) architecture. 

Clearly one can not be too pedantic in the utilization 
of the principle stated above and must appreciate the practica- 
lities of all design levels. 
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